Multi-environment pipelines in a team setting

Napkin provides a variety of ways environments can be segregated. Typically, a production-grade project will have at least two environments: production and development. It may be also desired to separate development environments of multiple developers.

Recommended data organization is as follows:

There should be at least two datasets in a project (e.g. BigQuery) or schemas in a database (e.g. Postgres): production and development.
Raw dataset input tables used by the pipeline can either be shared between environments or separated by the environment as well. In either case, we recommend placing raw inputs into their own dataset/schema for clarity. For example: raw_data or raw_data_development and raw_data_production for cases where they are separated.

If desirable, tables in the development environment should be prefixed by developer user name to avoid clashes as team members do work simultaneously. Below we present an example Spec snippet:

preprocessors:
  - table_namespace:
      value: development
      override_with_arg: environment
  - table_prefix:
      override_with_arg: developer
      separator: _

Defaults can be changed by providing --arg environment=production and --arg developer=kate. Production runs should be running with --arg developer= to disable the developer prefix. Note that Napkin will fail if table_prefix arg is not provided.

By default table_prefix and table_namespace are applied to all tables that are managed by Napkin. In projects that need to use different input datasets for production and development environments, input namespace can be also specified with renamers by changing scope from managed (default) to unmanaged:

preprocessors:
  # ...
  - table_namespace:
      value: development
      override_with_arg: input_dataset
      scope: unmanaged

Documentation	Use Cases	Get Napkin
Community	About Napkin
Features	Contact