Link Search Menu Expand Document

Napkin fundamentals tutorial

This is the first of a series of tutorials to get up and running with Napkin development. We do this through the example of building a simple Napkin data pipeline. The goal is that by the end of this tutorial, you’ll have a good idea of:

  • How to start a Napkin project
  • How to modify, verify, and execute Napkin projects
  • How to automatically test data
  • How Napkin helps you build a real-life application quickly

With that, let’s get started!

Getting started

There are a few preliminaries for this tutorial. First, we need you to have Napkin ready. While Napkin is a CLI-based tool, we recommend using Visual Studio Code with Remote Containers extension as it will deliver the best experience. In addition to the instructions, we provide a completed project as a Git repository with Devcontainer setup for an easy start, as it includes all extra necessary setup.

In this tutorial, we will use Chinook database that is commonly used as an example dataset. Chinook database represents a digital media store database with tables for artists, albums, media tracks, invoices, and customers. For sake of simplicity, we will use SQLite backend, so no additional setup for the database is required. In the final exercises we will deploy our pipeline to a Postgres-based production-grade environment. We will demonstrate Napkin’s features with simple queries to give the reader the sense of the benefits of using Napkin for more complex data pipelines.

Setting up with devcontainer

First, you need to install Docker, Visual Studio Code and Remote Containers extension. Note, that VSCodium is not supported. Next, clone the Git repository, open it with VSCode and select Reopen in Container command to start Docker-based development environment. You may be asked to trust the repository. The devcontainer contains both Sqlite and Postgres databases populated with an example dataset, as well as predefined VSCode tasks.

Setting up without devcontainer

We assume you’re running Linux or macOS. You need to install Napkin. If you are running Linux, you also need to install SQLite for this tutorial. For final exercises, you will also need to have access to the Postgres server. Windows users need to use Docker or Devcontainer setup that will bring all necessary tools.

Let’s download the database file from GitHub with the following command.

shell

curl -L -o chinook-sql.db https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite

To verify the Chinook database and SQLite work correctly run:

shell

sqlite3 chinook-sql.db "SELECT * FROM Artist"

With the preliminaries out of the way, we are ready to proceed with bootstrapping our first Napkin data pipeline.

Bootstrapping Napkin project

The most intuitive way of initiating a Napkin project is to use the init CLI command. napkin init creates a new project skeleton with Napkin’s default template.

shell

mkdir napkinFundamentalTutorial
cd napkinFundamentalTutorial
napkin init --project-name chinook-analytics

The corresponding directory structure created by Napkin’s init command for the chinook-analytics project is:

shell

chinook-analytics
├── hie.yaml
├── README.md
├── specs
│   └── spec.yaml
└── sql
    └── example.sql

Napkin’s CLI is the point of entry for all typical workflows of data engineering and pipeline curation. napkin --help provides a comprehensive list of available Napkin commands. Commands may take additional parameters. To get help with these parameters, use: napkin command --help. For instance, to get help with the init command, use napkin init --help. Please refer to CLI reference page for further information.

The data pipeline we will be working on is specified in specs/spec.yaml file, let’s explore how it looks like:

specs/spec.yaml

# yaml-language-server: $schema=https://napkin-public-storage-bucket.s3.us-east-1.amazonaws.com/schema/schema.json

# Connect to database:
backend: Postgres
db_url: postgresql://user@host/database
# set your password by providing it using --uri or set PGPASSWORD variable

# backend: BigQuery
# db_url: bigquery://bigquery.googleapis.com/project_name?dataset=default_dataset_name
# Run `napkin auth` to obtain authentication token

tables:
  example:
    create_action:
      sql_file:
        source: example.sql

First, we enable YAML schema support for IDE plugins, then we choose backend and provide DB connection details, finally, we’ll replace the example Napkin managed table.

Before moving to the next section, let’s validate our project using Napkin’s validate command:

shell

napkin validate

Let’s keep validating our spec as we move through the tutorial. You can also conveniently run validation in interactive mode, so Napkin will revalidate spec whenever the Spec file or queries are changed.

shell

napkin validate --interactive

By default, Napkin will use specs/spec.yaml relative to the current directory for all commands. You can explicitly select the spec with --spec-file argument.

Napkin follows a common paradigm where source tables are not mutated by Napkin. Instead, Napkin creates a series of dependent tables, managed tables, based on the SQL files in sql/ folder. Internally, Napkin uses specs/spec.yaml to create a DAG that describes data dependencies between managed and unmanaged tables. Napkin uses this DAG to drive the execution plan. In addition to DAG, Napkin Spec contains back-end connection-related data for connecting to supported target databases.

Setting DB connection

Before we can do any operations on the database, we need to configure the database connection. In our case, we need to point Napkin to the SQLite file. We can do this by replacing backend and db_uri in specs/spec.yaml file.

specs/spec.yaml

# Connect to Sqlite database:
backend: Sqlite
db_url: sqlite:chinook-sql.db

Let’s also remove the definition of the example table from our Spec, as well as sql/example.sql it references, as we no longer need them. Now, we can proceed with running our queries.

The first queries

Let’s start exploring our data set and calculate a summary of sales in each country. First, we need to create a SQL query and store it in sql/sales_by_country.sql file.

sql/sales_by_country.sql

SELECT
    "BillingCountry" AS "Country",
    SUM("Total") AS "Sales",
    COUNT(*) AS "NumInvoices"
FROM "Invoice"
GROUP BY "BillingCountry"

Now we need to link our SQL file to the sales_by_country table we’d like to create. Let’s add the following to the specs/spec.yaml file.

specs/spec.yaml

tables:
  sales_by_country:
    create_action:
      sql_file:
        source: sales_by_country.sql

By convention, Napkin will load SQL files from ../sql relative to the Spec file. This can be overridden with sql_dir: some_other_dir in the Spec file.

Let’s validate the spec again with napkin validate, and then we will be ready to run it for the first time:

shell

napkin run

output

[2022-01-12 17:40:28][Info] Determining tables for update...
[2022-01-12 17:40:28][Info] Forced tables: none
[2022-01-12 17:40:28][Info] Skipped tables: none
[2022-01-12 17:40:28][Info] Unmanaged tables that will be used as an input in this run:
- Invoice

[2022-01-12 17:40:28][Info] Managed tables that will be used as an input in this run, but will not be updated: none
[2022-01-12 17:40:28][Info] Managed tables that will be updated in this run:
- sales_by_country

[2022-01-12 17:40:28][Info] Estimated runtime: 0s
[2022-01-12 17:40:28][Info] Running table hooks (Pre)
[2022-01-12 17:40:28][Info] Executing table's action
[2022-01-12 17:40:28][Info] Table's action complete.
[2022-01-12 17:40:28][Info] Running table hooks (Post)
[2022-01-12 17:40:28][Info] Table' processing complete.
[2022-01-12 17:40:28][Info] TableSpec "sales_by_country" server stats:
[2022-01-12 17:40:28][Info] Execution completed. Please see below for a summary.
[2022-01-12 17:40:28][Info] ----------------------------------------------------------
[2022-01-12 17:40:28][Info] Table "sales_by_country" ran in 0.02:
[2022-01-12 17:40:28][Info] Run complete. Total cost:

Now the sales_by_country should exist in the database. Let’s double-check:

shell

sqlite3 chinook-sql.db "SELECT * FROM sales_by_country"

output

Argentina|37.62|7
Australia|37.62|7
Austria|42.62|7
Belgium|37.62|7
...

Best markets

We have decided to perform a more detailed analysis for ten countries with top sales. We expect to reuse that list over and over again, but also we would like to do this following the DRY principle. Since we have already calculated country sales, we can use sales_by_country table as an input. Let’s add top_10_countries table to our Spec:

sql/top_10_countries.sql

SELECT *
FROM sales_by_country
ORDER BY "Sales" DESC
FETCH FIRST 10 ROWS ONLY

specs/spec.yaml

tables:
  ...
  top_10_countries:
    create_action:
      sql_file:
        source: top_10_countries.sql

Next, we would like to know the number of customers in our top countries. Let’s add this to spec as well:

sql/top_10_countries_customers.sql

SELECT
    top_10_countries."Country" AS "Country",
    COUNT(*) AS "NumCustomers"
FROM top_10_countries
JOIN "Customer" ON top_10_countries."Country" = "Customer"."Country"
GROUP BY top_10_countries."Country"

specs/spec.yaml

tables:
  ...
  top_10_countries_customers:
    create_action:
      sql_file:
        source: top_10_countries_customers.sql

Let’s run the spec again and check if it’s all correct:

shell

sqlite3 chinook-sql.db "SELECT * FROM top_10_countries_customers"

output

Brazil|5
Canada|8
Chile|1
Czech Republic|2
France|5
Germany|4
India|2
Portugal|2
USA|13
United Kingdom|3

Discovering dependencies

We have added three tables so far to the Spec and successfully ran them. Napkin did some heavy lifting for us – it analyzed all queries, figured out dependencies and executed the queries in the correct order. Let’s explore this with the dump command which will store all queries (after being processed by Napkin), as well as dependency graph into dump/ folder:

shell

napkin dump
dump
├── 1_sales_by_country.sql
├── 2_top_10_countries.sql
├── 3_top_10_countries_customers.sql
├── dependency_graph.dot
├── dependency_graph.pdf
└── MANIFEST.txt

The dependency_graph.pdf file shows what dependencies have been discovered by Napkin. Note that Invoice and Customer tables are displayed differently, as they are not managed by Napkin.

DAG generated by Napkin

DRY: Don’t Repeat Yourself

Let’s explore our customers. Let’s first aggregate total sales per customer:

sql/customers_total_purchase.sql

SELECT
    "Customer"."CustomerId" AS "CustomerId",
    "Customer"."Email" AS "Email",
    "Customer"."FirstName" AS "FirstName",
    "Customer"."LastName" AS "LastName",
    SUM("Invoice"."Total") AS "CustomerSales"
FROM
    "Customer"
    JOIN "Invoice" ON "Customer"."CustomerId" = "Invoice"."CustomerId"
GROUP BY
    "Customer"."CustomerId"

specs/spec.yaml

tables:
  # ...
  customers_total_purchase:
    create_action:
      sql_file:
        source: customers_total_purchase.sql

Next, we’d like to find out our top 20 customers:

sql/top_20_customers.sql

SELECT *
FROM customers_total_purchase
ORDER BY "CustomerSales" DESC
FETCH FIRST 20 ROWS ONLY

specs/spec.yaml

tables:
  # ...
  top_20_customers:
    create_action:
      sql_file:
        source: top_20_customers.sql

Doesn’t that look familiar? Can we do better? The answer is: yes! Let’s start with creating our template query in sql/top_performers.sql file with some placeholders:

sql/top_performers.sql

SELECT *
FROM "{{table}}"
ORDER BY "{{top_by}}" DESC
FETCH FIRST {{max_limit}} ROWS ONLY

How do we use it? Let’s refactor our table definitions:

specs/spec.yaml

tables:
  # ...
  top_10_countries:
    create_action:
      sql_file:
        source: top_performers.sql
        vars:
          table: sales_by_country
          max_limit: 10
          top_by: "Sales"
  # ...
  top_20_customers:
    create_action:
      sql_file:
        source: top_performers.sql
        vars:
          table: customers_total_purchase
          max_limit: 20
          top_by: "CustomerSales"

Napkin uses Mustache template language to substitute variables specified in triple curly braces with values provided in Spec file. This enables to reuse queries and keep the configuration in the Spec.

We can now run our spec again and all new tables will be created for us. If we dump again, the dependency graph will be updated:

DAG generated by Napkin

Garbage in, garbage out: validating data

So far, we were able to verify if the results look good manually. However, as the spec grows and the source data is constantly updated by upstream services, it becomes a tedious and error-prone job. We would better like to automate this process to some extent. Napkin provides a way to run assertions on input data and the results of each table. Let’s invent a few invariants we could expect always be true from the data we have computed:

  • sales_by_country:
    • Country column should be unique,
    • should have more than 20 rows;
  • top_10_countries:
    • Country column should be unique,
    • should have exactly 10 rows;
  • top_10_countries_customers:
    • Country column should be unique,
    • should have exactly 10 rows,
    • NumCustomers column should have all values greater than zero;
  • customers_total_purchase:
    • CustomerSales column should have all values greater than zero,
    • CustomerId column should be unique;
  • top_20_customers:
    • should have exactly 20 rows,
    • CustomerId column should be unique.

Let’s start with checks for sales_by_country table, we will use assert_unique and assert_count assertions. Please refer to User Manual for reference on other hooks.

specs/spec.yaml

tables:
  # ...
  sales_by_country:
    # create_action: ...
    post_hooks:
      - assert_unique:
          columns: ["Country"]
      - assert_count:
          greater_than: 20

The invariant on customers_total_purchase that NumCustomers column has all values greater than zero is equivalent to requirement that minimum of that column is also greater than zero, and can be implemented with assert_expression hook:

specs/spec.yaml

tables:
  # ...
  customers_total_purchase:
    # create_action: ...
    post_hooks:
      # ...
      - assert_expression:
          expression:
            expression: MIN("CustomerSales") > 0

We can also check invariants on input tables. We recommend using pre_hooks instead, as they will run prior to the table execution. For example, we may expect that all invoices have a non-negative total:

tables:
  # ...
  sales_by_country:
    # create_action: ...
    # post_hooks: ...
    pre_hooks:
      - assert_expression:
          table: "Invoice"
          expression:
            expression: MIN("Total") >= 0

Note that in this case, we had to specify the table name manually – table argument defaults to the table being defined for post hooks only. After all checks have been implemented, our spec should look like this:

specs/spec.yaml

# yaml-language-server: $schema=https://napkin-public-storage-bucket.s3.us-east-1.amazonaws.com/schema/schema.json

app_name: chinook-sqlite
# Connect to database:
backend: Sqlite
db_url: sqlite:chinook.db

tables:
  sales_by_country:
    create_action:
      sql_file:
        source: sales_by_country.sql
    post_hooks:
      - assert_unique:
          columns: ["Country"]
      - assert_count:
          greater_than: 20
    pre_hooks:
      - assert_expression:
          table: "Invoice"
          expression: MIN("Total") >= 0

  top_10_countries:
    create_action:
      sql_file:
        source: top_performers.sql
        vars:
          table: sales_by_country
          max_limit: 10
          top_by: "Sales"
    post_hooks:
      - assert_unique:
          columns: ["Country"]
      - assert_count:
          equal: 10

  top_10_countries_customers:
    create_action:
      sql_file:
        source: top_10_countries_customers.sql
    post_hooks:
      - assert_unique:
          columns: ["Country"]
      - assert_count:
          equal: 10
      - assert_expression:
          expression: MIN("NumCustomers") > 0

  customers_total_purchase:
    create_action:
      sql_file:
        source: customers_total_purchase.sql
    post_hooks:
      - assert_unique:
          columns: ["CustomerId"]
      - assert_expression:
          expression: MIN("CustomerSales") >= 0

  top_20_customers:
    create_action:
      sql_file:
        source: top_performers.sql
        vars:
          table: customers_total_purchase
          max_limit: 20
          top_by: "CustomerSales"
    post_hooks:
      - assert_unique:
          columns: ["CustomerId"]
      - assert_count:
          equal: 20
      - assert_expression:
          expression: MIN("CustomerSales") > 0

When we run our spec, Napkin will report the status of all checks. By default, when any will fail, the spec execution will be aborted. We can make assertion to result in a warning only by adding on_failure: warn_only attribute.

output

[2021-12-29 12:02:44][Info] Execution completed. Please see below for a summary.
[2021-12-29 12:02:44][Info] ----------------------------------------------------------
[2021-12-29 12:02:44][Info] Table "customers_total_purchase" ran in 0.15: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion customers_total_purchase / Post / 1 / assertUniqueBy: customers_total_purchase: OK
[2021-12-29 12:02:44][Info] Assertion customers_total_purchase / Post / 2 / assertExpression: customers_total_purchase: OK
[2021-12-29 12:02:44][Info] Table "sales_by_country" ran in 0.19: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion sales_by_country / Pre / 1 / assertExpression: Invoice: OK
[2021-12-29 12:02:44][Info] Assertion sales_by_country / Post / 1 / assertUniqueBy: sales_by_country: OK
[2021-12-29 12:02:44][Info] Assertion sales_by_country / Post / 2 / assertCountConst: row count of "sales_by_country" should be greater than 20: OK
[2021-12-29 12:02:44][Info] Table "top_10_countries" ran in 0.16: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion top_10_countries / Post / 1 / assertUniqueBy: top_10_countries: OK
[2021-12-29 12:02:44][Info] Assertion top_10_countries / Post / 2 / assertCountConst: row count of "top_10_countries" should be equal to 10: OK
[2021-12-29 12:02:44][Info] Table "top_10_countries_customers" ran in 0.09: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion top_10_countries_customers / Post / 1 / assertUniqueBy: top_10_countries_customers: OK
[2021-12-29 12:02:44][Info] Assertion top_10_countries_customers / Post / 2 / assertCountConst: row count of "top_10_countries_customers" should be equal to 10: OK
[2021-12-29 12:02:44][Info] Assertion top_10_countries_customers / Post / 3 / assertExpression: top_10_countries_customers: OK
[2021-12-29 12:02:44][Info] Table "top_20_customers" ran in 0.13: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion top_20_customers / Post / 1 / assertUniqueBy: top_20_customers: OK
[2021-12-29 12:02:44][Info] Assertion top_20_customers / Post / 2 / assertCountConst: row count of "top_20_customers" should be equal to 20: OK
[2021-12-29 12:02:44][Info] Assertion top_20_customers / Post / 3 / assertExpression: top_20_customers: OK

Napkin remembers

Napkin internally keeps track of your application state. You might find it helpful to spend some time with:

shell

napkin history list --spec-file specs/spec.yaml

The data is stored by default in data/ directory in a Sqlite database. Alternatively, it may be stored in Postgres database (see metadata_url field in YAML reference). Data from historical runs is used for planning subsequent executions. As always, napkin history --help will provide a comprehensive list of history parameters.

Terminal Interface

If napkin run is started with the --show-progress flag, a navigable view of current napkin progress and basic statistics is shown.

  • Use the arrow keys or mouse to navigate the tables.
  • Hit Enter to see more details on a specific table.
  • Hit s to switch between preprocessed and postprocessed tables names (which can affected, for example, by a table renamer).
  • Hit q to quit or to cancel the current run.

Deploying to the production with PostgreSQL

We started this tutorial with the SQLite backend for convenience. In this optional section, we’ll change the backend to PostgreSQL which will let us run our spec in a more realistic, production-ready setup.

Setting up the database

Devcontainer

If you are using the devcontainer, all the necessary setup for this tutorial has been already done.

First, configure PGHOST, PGUSER, PGPASSWORD, and PGDATABASE to point to the appropriate PostgreSQL server. Setting up PostgreSQL is beyond the scope of this tutorial.

For production usage, it’s convenient to separate raw and derived data. We will use PostgreSQL schemas for this. We will create a chinook schema for raw data and a development and a production schemas for derived data. Then, we will load the Chinook database into the chinook schema. For more details please refer to our Multi-environment pipelines in a team setting tutorial.

shell

curl -L -O https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_PostgreSql.sql
psql <<-EOSQL
  CREATE SCHEMA chinook;
  CREATE SCHEMA development;
  CREATE SCHEMA production;
EOSQL
PGOPTIONS='--search_path=chinook' PGCLIENTENCODING=iso-8859-1 psql -1 -q -f Chinook_PostgreSql.sql

Changing the backend

First, we need to change backend and db_uri in our spec file:

specs/spec-postgres.yaml

# Connect to database:
backend: Postgres
db_url: postgresql://

Next, we need to specify table renaming rules, so we don’t need to specify chinook, development, and production schemas explicitly in each query.

specs/spec-postgres.yaml

preprocessors:
  - table_namespace:
      value: development
      scope: managed
      override_with_arg: environment

  - table_namespace:
      value: chinook
      scope: unmanaged

With this configuration, Napkin will assume that all unmanaged tables are located in the chinook schema, and all managed tables will be created in the development schema by default. This can be later overridden with the --arg environment=production CLI argument.

shell

napkin run --spec-file specs/spec-postgres.yaml

output

[2022-01-14 14:43:43][Info] Determining tables for update...
[2022-01-14 14:43:43][Info] Forced tables: none
[2022-01-14 14:43:43][Info] Skipped tables: none
[2022-01-14 14:43:43][Info] Unmanaged tables that will be used as an input in this run:
- chinook.Customer
- chinook.Invoice

[2022-01-14 14:43:43][Info] Managed tables that will be used as an input in this run, but will not be updated: none
[2022-01-14 14:43:43][Info] Managed tables that will be updated in this run:
- customers_total_purchase -> development.customers_total_purchase
- sales_by_country -> development.sales_by_country
- top_10_countries -> development.top_10_countries
- top_10_countries_customers -> development.top_10_countries_customers
- top_20_customers -> development.top_20_customers

[2022-01-14 14:43:43][Info] Estimated runtime: 0s
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] TableSpec "development.sales_by_country" server stats: 24 rows affected
[2022-01-14 14:43:43][Info] TableSpec "development.customers_total_purchase" server stats: 59 rows affected
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] TableSpec "development.top_10_countries" server stats: 10 rows affected
[2022-01-14 14:43:43][Info] TableSpec "development.top_20_customers" server stats: 20 rows affected
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] TableSpec "development.top_10_countries_customers" server stats: 10 rows affected
[2022-01-14 14:43:43][Info] Execution completed. Please see below for a summary.
[2022-01-14 14:43:43][Info] ----------------------------------------------------------
[2022-01-14 14:43:43][Info] Table "development"."customers_total_purchase" ran in 0.07: 59 rows affected
[2022-01-14 14:43:43][Info] Assertion development.customers_total_purchase / Post / 1 / assertUniqueBy: customers_total_purchase: OK
[2022-01-14 14:43:43][Info] Assertion development.customers_total_purchase / Post / 2 / assertExpression: customers_total_purchase: OK
[2022-01-14 14:43:43][Info] Table "development"."sales_by_country" ran in 0.06: 24 rows affected
[2022-01-14 14:43:43][Info] Assertion development.sales_by_country / Pre / 1 / assertExpression: Invoice: OK
[2022-01-14 14:43:43][Info] Assertion development.sales_by_country / Post / 1 / assertUniqueBy: sales_by_country: OK
[2022-01-14 14:43:43][Info] Assertion development.sales_by_country / Post / 2 / assertCountConst: row count of "sales_by_country" should be greater than 10: OK
[2022-01-14 14:43:43][Info] Table "development"."top_10_countries" ran in 0.06: 10 rows affected
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries / Post / 1 / assertUniqueBy: top_10_countries: OK
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries / Post / 2 / assertCountConst: row count of "top_10_countries" should be equal to 10: OK
[2022-01-14 14:43:43][Info] Table "development"."top_10_countries_customers" ran in 0.04: 10 rows affected
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries_customers / Post / 1 / assertUniqueBy: top_10_countries_customers: OK
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries_customers / Post / 2 / assertCountConst: row count of "top_10_countries_customers" should be equal to 10: OK
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries_customers / Post / 3 / assertExpression: top_10_countries_customers: OK
[2022-01-14 14:43:43][Info] Table "development"."top_20_customers" ran in 0.06: 20 rows affected
[2022-01-14 14:43:43][Info] Assertion development.top_20_customers / Post / 1 / assertUniqueBy: top_20_customers: OK
[2022-01-14 14:43:43][Info] Assertion development.top_20_customers / Post / 2 / assertCountConst: row count of "top_20_customers" should be equal to 20: OK
[2022-01-14 14:43:43][Info] Assertion development.top_20_customers / Post / 3 / assertExpression: top_20_customers: OK
[2022-01-14 14:43:43][Info] Run complete. Total cost: 123 rows affected

We can also perform a dry-run towards the production environment. We may observe in the logs, that the tables will be renamed accordingly to the environment specified.

shell

napkin run --spec-file specs/spec-postgres.yaml --arg environment=production --dry-run

output

[2022-01-14 14:46:32][Info] Determining tables for update...
[2022-01-14 14:46:32][Warning] Dry run, table specs and hooks will not be executed.
[2022-01-14 14:46:32][Info] Forced tables: none
[2022-01-14 14:46:32][Info] Skipped tables: none
[2022-01-14 14:46:32][Info] Unmanaged tables that will be used as an input in this run:
- chinook.Customer
- chinook.Invoice

[2022-01-14 14:46:32][Info] Managed tables that will be used as an input in this run, but will not be updated: none
[2022-01-14 14:46:32][Info] Managed tables that will be updated in this run:
- customers_total_purchase -> production.customers_total_purchase
- sales_by_country -> production.sales_by_country
- top_10_countries -> production.top_10_countries
- top_10_countries_customers -> production.top_10_countries_customers
- top_20_customers -> production.top_20_customers

[2022-01-14 14:46:32][Info] Estimated runtime: 0s
[2022-01-14 14:46:32][Warning] Dry run, aborting execution.

Summary

In this tutorial we:

  • bootstrapped a data pipeline – a Spec,
  • declaratively managed dependencies among Spec components,
  • used assertions to verify Specs execution meets the given criteria,
  • execute our Spec, data pipeline,
  • stored our work,
  • change the data pipeline backend database.

Please stay tuned for future Napkin tutorials.