As a data engineer, unless you've been living under a rock, you've probably been working with DBT, or aspire to do so. DBT is an great step in the right direction for data engineering, removing boilerplate tasks, and establishing observable contracts between models.
However, DBT projects can easily get out of hand. I often see entire data platforms defined in a single DBT project (monolithic repositories). In some organizations this doesn't cause much harm, but in many it becomes a nightmare to maintain. This is especially true when data ecosystem is large, and when data development is more federated with data products being maintained by separate teams. Also see my article on similar pitfalls with DAG's.
DBT has been thinking about this, and there are some major features in preview to help support multi-repo, federated project development. However these features will only be available for Enterprise customers.
Multi-repo Strategy
The good news is that if you are running DBT Core, or want to leverage existing features there are options. Since DBT is built on Python, and is easily extensible with packages. Packages are useful for utility functions, data quality, and just general code reusability. But they can also be used to import and reference other DBT project models.
How to
For demonstration purposes I've created two repos:
- Project A - parent repo, vanilla DBT "hello world".
- Project B - child repo that inherits from Project A.
For both projects you will need to setup a virtual environment and install the appropriate dbt package dbt-snowflake
, dbt-redshift
, etc.
In order for Project B to inherit you need to simply add the parent project to packages.yaml
, which will be imported when you run dbt deps
.
packages:
- git: "https://github.com/elliottcordo/dbt_poc_a.git"
You can now add the model to your dbt_project.yaml
. You can also override config parameters. This is especially helpful if your parent models exist in a different schema.
models:
dbt_poc_a:
# +schema: schemaname
# Config indicated by + and applies to all files under models/example/
dbt_poc_b:
# +schema: schemaname
# Config indicated by + and applies to all files under models/example/
dbt_poc_b:
+materialized: view
In the model specification you can now reference the models from Project A.
{{ config(materialized='table') }}
with source_data as (
select distinct *
from {{ ref('dbt_poc_a', 'my_first_dbt_model') }}
)
select *
from source_data
When running models in the child project you will most likely want to suppress building the upstream models (with the assumption they are maintained and built by a different team). You can use select and filter expressions in your dbt run
command too accomplish this: dbt run --select dbt_poc_b
⚠️ this is something you really need to be careful with!
You can now run dbt docs generate
and dbt docs serve
to view dependencies and cross model metadata..
Additional thoughts
Note that this approach alone will not enable safe federated DBT development. Process and culture will also come into play to avoid breaking changes to downstream models. You should also anticipate building a good amount of internal tooling, especially in your CI/CD pipelines.
And just a reminder, if your data platform is small, your data team is small and/or all development is centralized, this approach may be pre-mature optimization.
Top comments (1)
Insightful article Elliott (as always!). Great call out on scalability of approach versus data platform size.