Preprocessors
Preprocessors allow to systematically apply certain modifications to the Spec. Napkin has two preprocessors that facilitate table renaming built-in. Additionally, the user may implement custom preprocessors when needed.
Preprocessors are configured in the preprocessors
section of the YAML file. Multiple preprocessors can be used in a single spec, they will be applied in sequence. The syntax can be summarized as follows:
spec.yaml
preprocessors:
- builtin_a: # built-in preprocessor with arguments
param_foo: foo
param_bar: bar
- builtin_b # built-in preprocessor without arguments
- MyPreprocessors.custom_a: # custom preprocessor with arguments
param_foo: foo
param_bar: bar
- MyPreprocessors.custom_b # custom preprocessor without arguments
Built-in preprocessors
table_prefix
This preprocessor adds a prefix to table references. The tables are renamed when created by Napkin and all the queries are updated accordingly. By default, it affects only managed tables but can be also applied to unmanaged tables as well.
Arguments:
-
value
– prefix, optional. -
override_with_arg
– indicates the name of spec argument that allows overriding prefix with--arg
CLI option, optional. -
separator
– the preprocessor will insert an optional separator between prefix and table name if the prefix is not empty. -
scope
– indicates the tables that should be prefixed, optional, can be one of:-
managed
(default) unmanaged
all
-
-
only
– allows applying the renamer to selected tables only, optional. -
except
– allows applying the renamer to all but selected tables, optional.
Example: Segregating environments
Please refer to our Multi-environment pipelines in a team setting tutorial for recommended setup.
Prefix managed tables with environment
preprocessors:
- table_prefix:
value: development
override_with_arg: environment
separator: _
scope: managed
Napkin will rename all managed tables and prefix them with the environment name. The default environment is development
. For staging and production runs, one needs to provide --arg environment=staging
or --arg environment=production
to adjust the prefix accordingly.
Example: Isolating development environments
Prefix managed tables with developer name
preprocessors:
- table_prefix:
override_with_arg: developer # the preprocessor will fail if the developer name has not been provided explicitly when executing the spec
separator: _
scope: managed
Napkin will rename all managed tables and prefix them with the developer name. There is no default developer name, so the preprocessor will fail and prevent Spec from running. One needs to explicitly pass --arg developer=john
for each Spec run. Note, that if an empty developer argument will be provided (--arg developer=
) the table names will have the original name with no extra _
.
table_namespace
This preprocessor sets the namespace (BigQuery dataset or Postgres schema) to all table references. The table namespaces are renamed when created by Napkin and all the queries are updated accordingly. By default, it affects only managed tables but can be also applied to unmanaged tables as well.
Arguments:
-
value
– prefix, optional. -
override_with_arg
– indicates the name of spec argument that allows overriding prefix with--arg
CLI option, optional. -
scope
– indicates the tables that should be prefixed, optional, can be one of:-
managed
(default) unmanaged
all
-
-
only
– allows applying the renamer to selected tables only, optional. -
except
– allows applying the renamer to all but selected tables, optional. -
on_exists
– by default, this preprocessor will overwrite any schema that was explicitly provided in the Spec or any SQL query. This behavior can be changed to keep existing schema if any and set it only when missing, can be one of:-
overwrite
(default) keep_original
-
Example: Segregating environments
Isolate environments in namespaces
preprocessors:
- table_namespace:
value: development
override_with_arg: environment
Napkin will move all managed tables to the environment namespace. The default environment is development
. For staging and production runs, one needs to provide --arg environment=staging
or --arg environment=production
to adjust the prefix accordingly. Note that the BigQuery dataset or Postgres schema has to exist.
Example: Segregating output tables by the audience
Isolate environments in namespaces
preprocessors:
- table_namespace:
value: data_science_internal # will be overwritten if necessary
- table_namespace:
value: marketing
only:
- conversion
- customer_churn
- table_namespace:
value: warehouse
only:
- overstock
In some cases, we need to segregate tables by the expected audience and possibly limit access to the data. In the above example, we put all managed tables to data_science_internal
. We only store selected tables in separate namespaces. Database engine ACL can be used to configure access rules as needed.
Custom preprocessors
Coming soon