- What is Napkin
-
Napkin’s Philosophy
- Base as much data compute as possible on SQL
- Do as much compute as possible on modern analytics DBs like BigQuery/Redshift/Snowflake
- Abstract and reuse complex transformations where possible
- Data pipelines should be declarative and managed on Git
- Data pipelines should be regenerative
- Data pipeline dev should be lightweight on bare laptops
- Doctrine of extreme convenience
- Napkin’s Benefits
- Napkin’s Future
- Next Steps
What is Napkin
Napkin is a command line application that executes data pipelines of all sizes, backed by a feature-rich Haskell library offering programmatic freedom. It’s lightweight, offers a quick start for new projects and yet scales to massive data pipelines with powerful meta-programming possibilities.
Napkin has a broad vision in making life easier for data scientists and engineers, encapsulating a large portion of the data engineering landscape. It therefore bundles several key features together:
-
A consumer-grade Command Line Interface (CLI) that acts as the single point of entry for all typical workflows of data engineering and pipeline curation. The
napkin
app can refresh entire data pipelines, re-create individual tables, validate/typecheck pipelines in seconds, export dependency graphs and more. -
A multi-backend (w.g. BigQuery and Redshift) database runtime environment that provides for all key capabilities in executing a modern data pipeline, including interacting the database (see what’s there, query tables, create/recreate/update tables, etc.), performing runtime unit-tests/assertions, logging, timing and interacting with the outside world.
-
A built-in DAG orchestrator that can automatically detect all the dependency relationships in a data pipeline (e.g. 30+ tables) and perform the pipeline updates in the correct order. Data pipelines are called “Spec”s in napkin and ship with all batteries included: Ability to rewrite table destinations into different schemas/datasets for different environments (e.g. devel vs. prod), mass-prefixing/renaming tables, setting different “Refresh Strategies” for each table (e.g. update daily vs. only update when missing), a wide range of data unit-tests (e.g. table must be unique by columns X+Y) that are automatically performed each time the table is updated.
-
For the power user, a SQL wrapper DSL in Haskell that stays as close as possible to SQL, without any intermediary object or relational mappings. This DSL looks almost like regular SQL, but allows sophisticated programmatic manipulation and composition of SQL queries and statements. Napkin can parse regular SQL into this internal DSL, perform any desired manipulations and render it back out as regular SQL.
-
A sophisticated SQL meta-programming environment that accelerates modern data engineering efforts. Napkin users can interweave several options for crafting SQL as they see fit, even in the same file. These options include:
- Writing plain SQL files without any low-grade templating noise. Napkin will still auto-detect all dependencies and make the pipeline “just work”.
- Using lightweight variable substitutions in
.sql
files via Mustache templates. - Using sophisticated
{{#sExp}} ... {{/sExp}}
splices directly in.sql
files to write Haskell code that dynamically generates SQL fragments on the fly. - Expressing entire queries directly using napkin’s Haskell DSL, often used for dynamic generation of SQL code based on complex logic. For example, prediction trees can be rendered into SQL this way, sometimes generated 100K LOC SQL files from a single model.
Napkin’s Philosophy
Napkin was created to capitalize on an opportunity we noticed back in 2015 to (massively) accelerate our team’s data engineering capabilities and yet make the resulting code-bases way more sustainable/maintainable. At the time, we were drowning in the complexity of custom Hadoop MapReduce programs, Spark programs and repositories of ad-hoc SQL scripts targeted on Redshift/Hive/etc at the time. We created napkin because we sorely needed something more practical and reliable for our own work.
Over time, the opportunities we saw got crystallized into a set of philosophies we can articulate about what napkin is trying to achieve and whether it may be the huge catalyst for your team that it has been for us.
Base as much data compute as possible on SQL
Despite its age and missed opportunities, SQL code is declarative, functional and highly expressive. It’s easy to construct even for non-engineer data scientists/analysts and tends to offer good “equational reasoning”. It’s constrained just the right amount that business logic does not go “off the hook” like it can in typical programming languages like Python, R, Scala, etc. Once written and tested, SQL tends to produce reliable results.
Over the years, we have found almost all data engineering efforts outside of SQL to be error-prone, hard to grow and expensive (e.g. needs data engineers) to maintain over time.
If you can imagine how a table should be structured and express that table as a query in SQL, you can use napkin to engineer a pipeline.
Do as much compute as possible on modern analytics DBs like BigQuery/Redshift/Snowflake
Napkin aims to be a data engineering superpower even for very small teams. This is accomplished in large part by leaning on the amazing compute capabilities of modern analytics databases like BigQuery. Napkin’s creation goes back to our realization that if we could express even a very complex computation in SQL on these databases, no matter how convoluted, they would get the work done in astonishingly little time for minimal cost.
In our work, we have produced numerous 200,000+ LOC SQL queries using napkin’s meta-programming capabilities that run within minutes on databases like Amazon Redshift and Google’s BigQuery. Fun fact: BigQuery has a ~1M character limit on queries, which we sometimes bypass by breaking complex queries into parts and joining them up / unioning them later. Even this transformation can be done automatically for you by napkin in certain cases!
Abstract and reuse complex transformations where possible
Data pipelines should be declarative and managed on Git
Data pipelines should be regenerative
Data pipeline dev should be lightweight on bare laptops
Doctrine of extreme convenience
With napkin, we aim to make various data engineering and data science workflows so easy to perform that practitioners change their behavior to lean on them more frequently. We believe that speed and convenience without sacrificing correctness and reliability makes a huge difference in sustaining data ecosystem effectiveness.
Napkin’s Benefits
Here’s our best description of benefits you can expect after you’ve gotten a hang of napkin:
-
You’ll be able to see and manage your entire data pipeline in a simple codebase, in declarative fashion and in source control - just like any modern software project.
-
You’ll always be able to “blow away and fully refresh” your entire pipeline from raw data at the push of a button - recovering from mistakes will be a breeze.
-
Your data pipeline will entirely rely on the power of your backend database, whatever it may be. The likes of BigQuery for large datasets or Postgres (or even Sqlite) when you can get away with it on small data. You won’t rely on error prone Python pandas code, your own custom data processing application and similar constructs that are hard to grow/maintain and ensure correctness over time.
-
Your data will have actual unit tests that will confirm correctness with each update. (Example: Making sure you don’t double count sales)
-
You’ll benefit from tens of combinators we ship with napkin, such as incrementally updating large tables, column-to-row transformations, union-in same-structured tables into one, etc. As we improve napkin, you’ll get all that for free.
-
You’ll be able to implement your own clever SQL meta-programming to express logic that’d be too tedious to do in plain SQL. Yet the result will still have all the benefits of declarative SQL running on modern analytics databases, instead of your custom Python/R/Scala scripting machine. You’ll be able to create your own mini programs that produce 10-table “purchasing funnel” computations that connect just the right way based on configuration parameters supplied.
Napkin’s Future
Napkin is utilized heavily in commercial projects both at Soostone and at our clients. We improve napkin all the time and have a long backlog of major features we will realize in the future.
We would like to be transparent with our roadmap and are looking for ways to best communicate our plans. We’re currently maintaining a Trello board with our roadmap where we would love to hear your reactions and feedback. You can access our roadmap board at Napkin Roadmap