Link Search Menu Expand Document

Preprocessors

Preprocessors allow to systematically apply certain modifications to the Spec. Napkin has two preprocessors that facilitate table renaming built-in. Additionally, the user may implement custom preprocessors when needed.

Preprocessors are configured in the preprocessors section of the YAML file. Multiple preprocessors can be used in a single spec, they will be applied in sequence. The syntax can be summarized as follows:

spec.yaml

preprocessors:
  - builtin_a: # built-in preprocessor with arguments
      param_foo: foo
      param_bar: bar
  - builtin_b # built-in preprocessor without arguments
  - MyPreprocessors.custom_a: # custom preprocessor with arguments
      param_foo: foo
      param_bar: bar
  - MyPreprocessors.custom_b # custom preprocessor without arguments

Built-in preprocessors

table_prefix

This preprocessor adds a prefix to table references. The tables are renamed when created by Napkin and all the queries are updated accordingly. By default, it affects only managed tables but can be also applied to unmanaged tables as well.

Arguments:

  • value – prefix, optional.
  • override_with_arg – indicates the name of spec argument that allows overriding prefix with --arg CLI option, optional.
  • separator – the preprocessor will insert an optional separator between prefix and table name if the prefix is not empty.
  • scope – indicates the tables that should be prefixed, optional, can be one of:
    • managed (default)
    • unmanaged
    • all
  • only – allows applying the renamer to selected tables only, optional.
  • except – allows applying the renamer to all but selected tables, optional.

Example: Segregating environments

Please refer to our Multi-environment pipelines in a team setting tutorial for recommended setup.

Prefix managed tables with environment

preprocessors:
  - table_prefix:
      value: development
      override_with_arg: environment
      separator: _
      scope: managed

Napkin will rename all managed tables and prefix them with the environment name. The default environment is development. For staging and production runs, one needs to provide --arg environment=staging or --arg environment=production to adjust the prefix accordingly.

Example: Isolating development environments

Prefix managed tables with developer name

preprocessors:
  - table_prefix:
      override_with_arg: developer # the preprocessor will fail if the developer name has not been provided explicitly when executing the spec
      separator: _
      scope: managed

Napkin will rename all managed tables and prefix them with the developer name. There is no default developer name, so the preprocessor will fail and prevent Spec from running. One needs to explicitly pass --arg developer=john for each Spec run. Note, that if an empty developer argument will be provided (--arg developer=) the table names will have the original name with no extra _.

table_namespace

This preprocessor sets the namespace (BigQuery dataset or Postgres schema) to all table references. The table namespaces are renamed when created by Napkin and all the queries are updated accordingly. By default, it affects only managed tables but can be also applied to unmanaged tables as well.

Arguments:

  • value – prefix, optional.
  • override_with_arg – indicates the name of spec argument that allows overriding prefix with --arg CLI option, optional.
  • scope – indicates the tables that should be prefixed, optional, can be one of:
    • managed (default)
    • unmanaged
    • all
  • only – allows applying the renamer to selected tables only, optional.
  • except – allows applying the renamer to all but selected tables, optional.
  • on_exists – by default, this preprocessor will overwrite any schema that was explicitly provided in the Spec or any SQL query. This behavior can be changed to keep existing schema if any and set it only when missing, can be one of:
    • overwrite (default)
    • keep_original

Example: Segregating environments

Isolate environments in namespaces

preprocessors:
  - table_namespace:
      value: development
      override_with_arg: environment

Napkin will move all managed tables to the environment namespace. The default environment is development. For staging and production runs, one needs to provide --arg environment=staging or --arg environment=production to adjust the prefix accordingly. Note that the BigQuery dataset or Postgres schema has to exist.

Example: Segregating output tables by the audience

Isolate environments in namespaces

preprocessors:
  - table_namespace:
      value: data_science_internal # will be overwritten if necessary
  - table_namespace:
      value: marketing
      only:
        - conversion
        - customer_churn
  - table_namespace:
      value: warehouse
      only:
        - overstock

In some cases, we need to segregate tables by the expected audience and possibly limit access to the data. In the above example, we put all managed tables to data_science_internal. We only store selected tables in separate namespaces. Database engine ACL can be used to configure access rules as needed.

Custom preprocessors

Coming soon :zzz: