Napkin fundamentals tutorial
This is the first of a series of tutorials to get up and running with Napkin development. We do this through the example of building a simple Napkin data pipeline. The goal is that by the end of this tutorial, you’ll have a good idea of:
- How to start a Napkin project
- How to modify, verify, and execute Napkin projects
- How to automatically test data
- How Napkin helps you build a real-life application quickly
With that, let’s get started!
- Getting started
- Bootstrapping Napkin project
- Setting DB connection
- The first queries
- Best markets
- Discovering dependencies
- DRY: Don’t Repeat Yourself
- Garbage in, garbage out: validating data
- Napkin remembers
- Terminal Interface
- Deploying to the production with PostgreSQL
- Summary
Getting started
There are a few preliminaries for this tutorial. First, we need you to have Napkin ready. While Napkin is a CLI-based tool, we recommend using Visual Studio Code with Remote Containers extension as it will deliver the best experience. In addition to the instructions, we provide a completed project as a Git repository with Devcontainer setup for an easy start, as it includes all extra necessary setup.
In this tutorial, we will use Chinook database that is commonly used as an example dataset. Chinook database represents a digital media store database with tables for artists, albums, media tracks, invoices, and customers. For sake of simplicity, we will use SQLite backend, so no additional setup for the database is required. In the final exercises we will deploy our pipeline to a Postgres-based production-grade environment. We will demonstrate Napkin’s features with simple queries to give the reader the sense of the benefits of using Napkin for more complex data pipelines.
Setting up with devcontainer
First, you need to install Docker, Visual Studio Code and Remote Containers extension. Note, that VSCodium is not supported. Next, clone the Git repository, open it with VSCode and select Reopen in Container command to start Docker-based development environment. You may be asked to trust the repository. The devcontainer contains both Sqlite and Postgres databases populated with an example dataset, as well as predefined VSCode tasks.
Setting up without devcontainer
We assume you’re running Linux or macOS. You need to install Napkin. If you are running Linux, you also need to install SQLite for this tutorial. For final exercises, you will also need to have access to the Postgres server. Windows users need to use Docker or Devcontainer setup that will bring all necessary tools.
Let’s download the database file from GitHub with the following command.
shell
curl -L -o chinook-sql.db https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
To verify the Chinook database and SQLite work correctly run:
shell
sqlite3 chinook-sql.db "SELECT * FROM Artist"
With the preliminaries out of the way, we are ready to proceed with bootstrapping our first Napkin data pipeline.
Bootstrapping Napkin project
The most intuitive way of initiating a Napkin project is to use the init
CLI command. napkin init
creates a new project skeleton with Napkin’s default template.
shell
mkdir napkinFundamentalTutorial
cd napkinFundamentalTutorial
napkin init --project-name chinook-analytics
The corresponding directory structure created by Napkin’s init
command for the chinook-analytics
project is:
shell
chinook-analytics
├── hie.yaml
├── README.md
├── specs
│ └── spec.yaml
└── sql
└── example.sql
Napkin’s CLI is the point of entry for all typical workflows of data engineering and pipeline curation. napkin --help
provides a comprehensive list of available Napkin commands. Commands may take additional parameters. To get help with these parameters, use: napkin command --help
. For instance, to get help with the init
command, use napkin init --help
. Please refer to CLI reference page for further information.
The data pipeline we will be working on is specified in specs/spec.yaml
file, let’s explore how it looks like:
specs/spec.yaml
# yaml-language-server: $schema=https://napkin-public-storage-bucket.s3.us-east-1.amazonaws.com/schema/schema.json
# Connect to database:
backend: Postgres
db_url: postgresql://user@host/database
# set your password by providing it using --uri or set PGPASSWORD variable
# backend: BigQuery
# db_url: bigquery://bigquery.googleapis.com/project_name?dataset=default_dataset_name
# Run `napkin auth` to obtain authentication token
tables:
example:
create_action:
sql_file:
source: example.sql
First, we enable YAML schema support for IDE plugins, then we choose backend and provide DB connection details, finally, we’ll replace the example
Napkin managed table.
Before moving to the next section, let’s validate our project using Napkin’s validate
command:
shell
napkin validate
Let’s keep validating our spec as we move through the tutorial. You can also conveniently run validation in interactive mode, so Napkin will revalidate spec whenever the Spec file or queries are changed.
shell
napkin validate --interactive
By default, Napkin will use specs/spec.yaml
relative to the current directory for all commands. You can explicitly select the spec with --spec-file
argument.
Napkin follows a common paradigm where source tables are not mutated by Napkin. Instead, Napkin creates a series of dependent tables, managed tables, based on the SQL files in sql/
folder. Internally, Napkin uses specs/spec.yaml
to create a DAG that describes data dependencies between managed and unmanaged tables. Napkin uses this DAG to drive the execution plan. In addition to DAG, Napkin Spec contains back-end connection-related data for connecting to supported target databases.
Setting DB connection
Before we can do any operations on the database, we need to configure the database connection. In our case, we need to point Napkin to the SQLite file. We can do this by replacing backend
and db_uri
in specs/spec.yaml
file.
specs/spec.yaml
# Connect to Sqlite database:
backend: Sqlite
db_url: sqlite:chinook-sql.db
Let’s also remove the definition of the example
table from our Spec, as well as sql/example.sql
it references, as we no longer need them. Now, we can proceed with running our queries.
The first queries
Let’s start exploring our data set and calculate a summary of sales in each country. First, we need to create a SQL query and store it in sql/sales_by_country.sql
file.
sql/sales_by_country.sql
SELECT
"BillingCountry" AS "Country",
SUM("Total") AS "Sales",
COUNT(*) AS "NumInvoices"
FROM "Invoice"
GROUP BY "BillingCountry"
Now we need to link our SQL file to the sales_by_country
table we’d like to create. Let’s add the following to the specs/spec.yaml
file.
specs/spec.yaml
tables:
sales_by_country:
create_action:
sql_file:
source: sales_by_country.sql
By convention, Napkin will load SQL files from ../sql
relative to the Spec file. This can be overridden with sql_dir: some_other_dir
in the Spec file.
Let’s validate the spec again with napkin validate
, and then we will be ready to run it for the first time:
shell
napkin run
output
[2022-01-12 17:40:28][Info] Determining tables for update...
[2022-01-12 17:40:28][Info] Forced tables: none
[2022-01-12 17:40:28][Info] Skipped tables: none
[2022-01-12 17:40:28][Info] Unmanaged tables that will be used as an input in this run:
- Invoice
[2022-01-12 17:40:28][Info] Managed tables that will be used as an input in this run, but will not be updated: none
[2022-01-12 17:40:28][Info] Managed tables that will be updated in this run:
- sales_by_country
[2022-01-12 17:40:28][Info] Estimated runtime: 0s
[2022-01-12 17:40:28][Info] Running table hooks (Pre)
[2022-01-12 17:40:28][Info] Executing table's action
[2022-01-12 17:40:28][Info] Table's action complete.
[2022-01-12 17:40:28][Info] Running table hooks (Post)
[2022-01-12 17:40:28][Info] Table' processing complete.
[2022-01-12 17:40:28][Info] TableSpec "sales_by_country" server stats:
[2022-01-12 17:40:28][Info] Execution completed. Please see below for a summary.
[2022-01-12 17:40:28][Info] ----------------------------------------------------------
[2022-01-12 17:40:28][Info] Table "sales_by_country" ran in 0.02:
[2022-01-12 17:40:28][Info] Run complete. Total cost:
Now the sales_by_country
should exist in the database. Let’s double-check:
shell
sqlite3 chinook-sql.db "SELECT * FROM sales_by_country"
output
Argentina|37.62|7
Australia|37.62|7
Austria|42.62|7
Belgium|37.62|7
...
Best markets
We have decided to perform a more detailed analysis for ten countries with top sales. We expect to reuse that list over and over again, but also we would like to do this following the DRY principle. Since we have already calculated country sales, we can use sales_by_country
table as an input. Let’s add top_10_countries
table to our Spec:
sql/top_10_countries.sql
SELECT *
FROM sales_by_country
ORDER BY "Sales" DESC
FETCH FIRST 10 ROWS ONLY
specs/spec.yaml
tables:
...
top_10_countries:
create_action:
sql_file:
source: top_10_countries.sql
Next, we would like to know the number of customers in our top countries. Let’s add this to spec as well:
sql/top_10_countries_customers.sql
SELECT
top_10_countries."Country" AS "Country",
COUNT(*) AS "NumCustomers"
FROM top_10_countries
JOIN "Customer" ON top_10_countries."Country" = "Customer"."Country"
GROUP BY top_10_countries."Country"
specs/spec.yaml
tables:
...
top_10_countries_customers:
create_action:
sql_file:
source: top_10_countries_customers.sql
Let’s run the spec again and check if it’s all correct:
shell
sqlite3 chinook-sql.db "SELECT * FROM top_10_countries_customers"
output
Brazil|5
Canada|8
Chile|1
Czech Republic|2
France|5
Germany|4
India|2
Portugal|2
USA|13
United Kingdom|3
Discovering dependencies
We have added three tables so far to the Spec and successfully ran them. Napkin did some heavy lifting for us – it analyzed all queries, figured out dependencies and executed the queries in the correct order. Let’s explore this with the dump
command which will store all queries (after being processed by Napkin), as well as dependency graph into dump/
folder:
shell
napkin dump
dump
├── 1_sales_by_country.sql
├── 2_top_10_countries.sql
├── 3_top_10_countries_customers.sql
├── dependency_graph.dot
├── dependency_graph.pdf
└── MANIFEST.txt
The dependency_graph.pdf
file shows what dependencies have been discovered by Napkin. Note that Invoice
and Customer
tables are displayed differently, as they are not managed by Napkin.
DRY: Don’t Repeat Yourself
Let’s explore our customers. Let’s first aggregate total sales per customer:
sql/customers_total_purchase.sql
SELECT
"Customer"."CustomerId" AS "CustomerId",
"Customer"."Email" AS "Email",
"Customer"."FirstName" AS "FirstName",
"Customer"."LastName" AS "LastName",
SUM("Invoice"."Total") AS "CustomerSales"
FROM
"Customer"
JOIN "Invoice" ON "Customer"."CustomerId" = "Invoice"."CustomerId"
GROUP BY
"Customer"."CustomerId"
specs/spec.yaml
tables:
# ...
customers_total_purchase:
create_action:
sql_file:
source: customers_total_purchase.sql
Next, we’d like to find out our top 20 customers:
sql/top_20_customers.sql
SELECT *
FROM customers_total_purchase
ORDER BY "CustomerSales" DESC
FETCH FIRST 20 ROWS ONLY
specs/spec.yaml
tables:
# ...
top_20_customers:
create_action:
sql_file:
source: top_20_customers.sql
Doesn’t that look familiar? Can we do better? The answer is: yes! Let’s start with creating our template query in sql/top_performers.sql
file with some placeholders:
sql/top_performers.sql
SELECT *
FROM "{{table}}"
ORDER BY "{{top_by}}" DESC
FETCH FIRST {{max_limit}} ROWS ONLY
How do we use it? Let’s refactor our table definitions:
specs/spec.yaml
tables:
# ...
top_10_countries:
create_action:
sql_file:
source: top_performers.sql
vars:
table: sales_by_country
max_limit: 10
top_by: "Sales"
# ...
top_20_customers:
create_action:
sql_file:
source: top_performers.sql
vars:
table: customers_total_purchase
max_limit: 20
top_by: "CustomerSales"
Napkin uses Mustache template language to substitute variables specified in triple curly braces with values provided in Spec file. This enables to reuse queries and keep the configuration in the Spec.
We can now run our spec again and all new tables will be created for us. If we dump
again, the dependency graph will be updated:
Garbage in, garbage out: validating data
So far, we were able to verify if the results look good manually. However, as the spec grows and the source data is constantly updated by upstream services, it becomes a tedious and error-prone job. We would better like to automate this process to some extent. Napkin provides a way to run assertions on input data and the results of each table. Let’s invent a few invariants we could expect always be true from the data we have computed:
-
sales_by_country
:-
Country
column should be unique, - should have more than 20 rows;
-
-
top_10_countries
:-
Country
column should be unique, - should have exactly 10 rows;
-
-
top_10_countries_customers
:-
Country
column should be unique, - should have exactly 10 rows,
-
NumCustomers
column should have all values greater than zero;
-
-
customers_total_purchase
:-
CustomerSales
column should have all values greater than zero, -
CustomerId
column should be unique;
-
-
top_20_customers
:- should have exactly 20 rows,
-
CustomerId
column should be unique.
Let’s start with checks for sales_by_country
table, we will use assert_unique
and assert_count
assertions. Please refer to User Manual for reference on other hooks.
specs/spec.yaml
tables:
# ...
sales_by_country:
# create_action: ...
post_hooks:
- assert_unique:
columns: ["Country"]
- assert_count:
greater_than: 20
The invariant on customers_total_purchase
that NumCustomers
column has all values greater than zero is equivalent to requirement that minimum of that column is also greater than zero, and can be implemented with assert_expression
hook:
specs/spec.yaml
tables:
# ...
customers_total_purchase:
# create_action: ...
post_hooks:
# ...
- assert_expression:
expression:
expression: MIN("CustomerSales") > 0
We can also check invariants on input tables. We recommend using pre_hooks
instead, as they will run prior to the table execution. For example, we may expect that all invoices have a non-negative total:
tables:
# ...
sales_by_country:
# create_action: ...
# post_hooks: ...
pre_hooks:
- assert_expression:
table: "Invoice"
expression:
expression: MIN("Total") >= 0
Note that in this case, we had to specify the table name manually – table
argument defaults to the table being defined for post hooks only. After all checks have been implemented, our spec should look like this:
specs/spec.yaml
# yaml-language-server: $schema=https://napkin-public-storage-bucket.s3.us-east-1.amazonaws.com/schema/schema.json
app_name: chinook-sqlite
# Connect to database:
backend: Sqlite
db_url: sqlite:chinook.db
tables:
sales_by_country:
create_action:
sql_file:
source: sales_by_country.sql
post_hooks:
- assert_unique:
columns: ["Country"]
- assert_count:
greater_than: 20
pre_hooks:
- assert_expression:
table: "Invoice"
expression: MIN("Total") >= 0
top_10_countries:
create_action:
sql_file:
source: top_performers.sql
vars:
table: sales_by_country
max_limit: 10
top_by: "Sales"
post_hooks:
- assert_unique:
columns: ["Country"]
- assert_count:
equal: 10
top_10_countries_customers:
create_action:
sql_file:
source: top_10_countries_customers.sql
post_hooks:
- assert_unique:
columns: ["Country"]
- assert_count:
equal: 10
- assert_expression:
expression: MIN("NumCustomers") > 0
customers_total_purchase:
create_action:
sql_file:
source: customers_total_purchase.sql
post_hooks:
- assert_unique:
columns: ["CustomerId"]
- assert_expression:
expression: MIN("CustomerSales") >= 0
top_20_customers:
create_action:
sql_file:
source: top_performers.sql
vars:
table: customers_total_purchase
max_limit: 20
top_by: "CustomerSales"
post_hooks:
- assert_unique:
columns: ["CustomerId"]
- assert_count:
equal: 20
- assert_expression:
expression: MIN("CustomerSales") > 0
When we run our spec, Napkin will report the status of all checks. By default, when any will fail, the spec execution will be aborted. We can make assertion to result in a warning only by adding on_failure: warn_only
attribute.
output
[2021-12-29 12:02:44][Info] Execution completed. Please see below for a summary.
[2021-12-29 12:02:44][Info] ----------------------------------------------------------
[2021-12-29 12:02:44][Info] Table "customers_total_purchase" ran in 0.15: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion customers_total_purchase / Post / 1 / assertUniqueBy: customers_total_purchase: OK
[2021-12-29 12:02:44][Info] Assertion customers_total_purchase / Post / 2 / assertExpression: customers_total_purchase: OK
[2021-12-29 12:02:44][Info] Table "sales_by_country" ran in 0.19: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion sales_by_country / Pre / 1 / assertExpression: Invoice: OK
[2021-12-29 12:02:44][Info] Assertion sales_by_country / Post / 1 / assertUniqueBy: sales_by_country: OK
[2021-12-29 12:02:44][Info] Assertion sales_by_country / Post / 2 / assertCountConst: row count of "sales_by_country" should be greater than 20: OK
[2021-12-29 12:02:44][Info] Table "top_10_countries" ran in 0.16: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion top_10_countries / Post / 1 / assertUniqueBy: top_10_countries: OK
[2021-12-29 12:02:44][Info] Assertion top_10_countries / Post / 2 / assertCountConst: row count of "top_10_countries" should be equal to 10: OK
[2021-12-29 12:02:44][Info] Table "top_10_countries_customers" ran in 0.09: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion top_10_countries_customers / Post / 1 / assertUniqueBy: top_10_countries_customers: OK
[2021-12-29 12:02:44][Info] Assertion top_10_countries_customers / Post / 2 / assertCountConst: row count of "top_10_countries_customers" should be equal to 10: OK
[2021-12-29 12:02:44][Info] Assertion top_10_countries_customers / Post / 3 / assertExpression: top_10_countries_customers: OK
[2021-12-29 12:02:44][Info] Table "top_20_customers" ran in 0.13: 0 rows affected
[2021-12-29 12:02:44][Info] Assertion top_20_customers / Post / 1 / assertUniqueBy: top_20_customers: OK
[2021-12-29 12:02:44][Info] Assertion top_20_customers / Post / 2 / assertCountConst: row count of "top_20_customers" should be equal to 20: OK
[2021-12-29 12:02:44][Info] Assertion top_20_customers / Post / 3 / assertExpression: top_20_customers: OK
Napkin remembers
Napkin internally keeps track of your application state. You might find it helpful to spend some time with:
shell
napkin history list --spec-file specs/spec.yaml
The data is stored by default in data/
directory in a Sqlite database. Alternatively, it may be stored in Postgres database (see metadata_url
field in YAML reference). Data from historical runs is used for planning subsequent executions. As always, napkin history --help
will provide a comprehensive list of history parameters.
Terminal Interface
If napkin run
is started with the --show-progress
flag, a navigable view of current napkin progress and basic statistics is shown.
- Use the arrow keys or mouse to navigate the tables.
- Hit
Enter
to see more details on a specific table. - Hit
s
to switch between preprocessed and postprocessed tables names (which can affected, for example, by a table renamer). - Hit
q
to quit or to cancel the current run.
Deploying to the production with PostgreSQL
We started this tutorial with the SQLite backend for convenience. In this optional section, we’ll change the backend to PostgreSQL which will let us run our spec in a more realistic, production-ready setup.
Setting up the database
Devcontainer
If you are using the devcontainer, all the necessary setup for this tutorial has been already done.
First, configure PGHOST
, PGUSER
, PGPASSWORD
, and PGDATABASE
to point to the appropriate PostgreSQL server. Setting up PostgreSQL is beyond the scope of this tutorial.
For production usage, it’s convenient to separate raw and derived data. We will use PostgreSQL schemas for this. We will create a chinook
schema for raw data and a development
and a production
schemas for derived data. Then, we will load the Chinook database into the chinook
schema. For more details please refer to our Multi-environment pipelines in a team setting tutorial.
shell
curl -L -O https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_PostgreSql.sql
psql <<-EOSQL
CREATE SCHEMA chinook;
CREATE SCHEMA development;
CREATE SCHEMA production;
EOSQL
PGOPTIONS='--search_path=chinook' PGCLIENTENCODING=iso-8859-1 psql -1 -q -f Chinook_PostgreSql.sql
Changing the backend
First, we need to change backend
and db_uri
in our spec file:
specs/spec-postgres.yaml
# Connect to database:
backend: Postgres
db_url: postgresql://
Next, we need to specify table renaming rules, so we don’t need to specify chinook
, development
, and production
schemas explicitly in each query.
specs/spec-postgres.yaml
preprocessors:
- table_namespace:
value: development
scope: managed
override_with_arg: environment
- table_namespace:
value: chinook
scope: unmanaged
With this configuration, Napkin will assume that all unmanaged tables are located in the chinook
schema, and all managed tables will be created in the development
schema by default. This can be later overridden with the --arg environment=production
CLI argument.
shell
napkin run --spec-file specs/spec-postgres.yaml
output
[2022-01-14 14:43:43][Info] Determining tables for update...
[2022-01-14 14:43:43][Info] Forced tables: none
[2022-01-14 14:43:43][Info] Skipped tables: none
[2022-01-14 14:43:43][Info] Unmanaged tables that will be used as an input in this run:
- chinook.Customer
- chinook.Invoice
[2022-01-14 14:43:43][Info] Managed tables that will be used as an input in this run, but will not be updated: none
[2022-01-14 14:43:43][Info] Managed tables that will be updated in this run:
- customers_total_purchase -> development.customers_total_purchase
- sales_by_country -> development.sales_by_country
- top_10_countries -> development.top_10_countries
- top_10_countries_customers -> development.top_10_countries_customers
- top_20_customers -> development.top_20_customers
[2022-01-14 14:43:43][Info] Estimated runtime: 0s
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] TableSpec "development.sales_by_country" server stats: 24 rows affected
[2022-01-14 14:43:43][Info] TableSpec "development.customers_total_purchase" server stats: 59 rows affected
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] TableSpec "development.top_10_countries" server stats: 10 rows affected
[2022-01-14 14:43:43][Info] TableSpec "development.top_20_customers" server stats: 20 rows affected
[2022-01-14 14:43:43][Info] Running table hooks (Pre)
[2022-01-14 14:43:43][Info] Executing table's action
[2022-01-14 14:43:43][Info] Table's action complete.
[2022-01-14 14:43:43][Info] Running table hooks (Post)
[2022-01-14 14:43:43][Info] Table' processing complete.
[2022-01-14 14:43:43][Info] TableSpec "development.top_10_countries_customers" server stats: 10 rows affected
[2022-01-14 14:43:43][Info] Execution completed. Please see below for a summary.
[2022-01-14 14:43:43][Info] ----------------------------------------------------------
[2022-01-14 14:43:43][Info] Table "development"."customers_total_purchase" ran in 0.07: 59 rows affected
[2022-01-14 14:43:43][Info] Assertion development.customers_total_purchase / Post / 1 / assertUniqueBy: customers_total_purchase: OK
[2022-01-14 14:43:43][Info] Assertion development.customers_total_purchase / Post / 2 / assertExpression: customers_total_purchase: OK
[2022-01-14 14:43:43][Info] Table "development"."sales_by_country" ran in 0.06: 24 rows affected
[2022-01-14 14:43:43][Info] Assertion development.sales_by_country / Pre / 1 / assertExpression: Invoice: OK
[2022-01-14 14:43:43][Info] Assertion development.sales_by_country / Post / 1 / assertUniqueBy: sales_by_country: OK
[2022-01-14 14:43:43][Info] Assertion development.sales_by_country / Post / 2 / assertCountConst: row count of "sales_by_country" should be greater than 10: OK
[2022-01-14 14:43:43][Info] Table "development"."top_10_countries" ran in 0.06: 10 rows affected
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries / Post / 1 / assertUniqueBy: top_10_countries: OK
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries / Post / 2 / assertCountConst: row count of "top_10_countries" should be equal to 10: OK
[2022-01-14 14:43:43][Info] Table "development"."top_10_countries_customers" ran in 0.04: 10 rows affected
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries_customers / Post / 1 / assertUniqueBy: top_10_countries_customers: OK
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries_customers / Post / 2 / assertCountConst: row count of "top_10_countries_customers" should be equal to 10: OK
[2022-01-14 14:43:43][Info] Assertion development.top_10_countries_customers / Post / 3 / assertExpression: top_10_countries_customers: OK
[2022-01-14 14:43:43][Info] Table "development"."top_20_customers" ran in 0.06: 20 rows affected
[2022-01-14 14:43:43][Info] Assertion development.top_20_customers / Post / 1 / assertUniqueBy: top_20_customers: OK
[2022-01-14 14:43:43][Info] Assertion development.top_20_customers / Post / 2 / assertCountConst: row count of "top_20_customers" should be equal to 20: OK
[2022-01-14 14:43:43][Info] Assertion development.top_20_customers / Post / 3 / assertExpression: top_20_customers: OK
[2022-01-14 14:43:43][Info] Run complete. Total cost: 123 rows affected
We can also perform a dry-run towards the production environment. We may observe in the logs, that the tables will be renamed accordingly to the environment specified.
shell
napkin run --spec-file specs/spec-postgres.yaml --arg environment=production --dry-run
output
[2022-01-14 14:46:32][Info] Determining tables for update...
[2022-01-14 14:46:32][Warning] Dry run, table specs and hooks will not be executed.
[2022-01-14 14:46:32][Info] Forced tables: none
[2022-01-14 14:46:32][Info] Skipped tables: none
[2022-01-14 14:46:32][Info] Unmanaged tables that will be used as an input in this run:
- chinook.Customer
- chinook.Invoice
[2022-01-14 14:46:32][Info] Managed tables that will be used as an input in this run, but will not be updated: none
[2022-01-14 14:46:32][Info] Managed tables that will be updated in this run:
- customers_total_purchase -> production.customers_total_purchase
- sales_by_country -> production.sales_by_country
- top_10_countries -> production.top_10_countries
- top_10_countries_customers -> production.top_10_countries_customers
- top_20_customers -> production.top_20_customers
[2022-01-14 14:46:32][Info] Estimated runtime: 0s
[2022-01-14 14:46:32][Warning] Dry run, aborting execution.
Summary
In this tutorial we:
- bootstrapped a data pipeline – a Spec,
- declaratively managed dependencies among Spec components,
- used assertions to verify Specs execution meets the given criteria,
- execute our Spec, data pipeline,
- stored our work,
- change the data pipeline backend database.
Please stay tuned for future Napkin tutorials.