Multi-environment pipelines in a team setting
Napkin provides a variety of ways environments can be segregated. Typically, a production-grade project will have at least two environments: production and development. It may be also desired to separate development environments of multiple developers.
Recommended data organization is as follows:
-
There should be at least two datasets in a project (e.g. BigQuery) or schemas in a database (e.g. Postgres):
production
anddevelopment
. -
Raw dataset input tables used by the pipeline can either be shared between environments or separated by the environment as well. In either case, we recommend placing raw inputs into their own dataset/schema for clarity. For example:
raw_data
orraw_data_development
andraw_data_production
for cases where they are separated.
If desirable, tables in the development environment should be prefixed by developer user name to avoid clashes as team members do work simultaneously. Below we present an example Spec snippet:
preprocessors:
- table_namespace:
value: development
override_with_arg: environment
- table_prefix:
override_with_arg: developer
separator: _
Defaults can be changed by providing --arg environment=production
and --arg developer=kate
. Production runs should be running with --arg developer=
to disable the developer prefix. Note that Napkin will fail if table_prefix
arg is not provided.
By default table_prefix
and table_namespace
are applied to all tables that are managed by Napkin. In projects that need to use different input datasets for production and development environments, input namespace can be also specified with renamers by changing scope from managed
(default) to unmanaged
:
preprocessors:
# ...
- table_namespace:
value: development
override_with_arg: input_dataset
scope: unmanaged