There are many posts about Git branching strategies out there, but they're either light on details or heavy on complexity. My aim here is to define the simplest possible production-grade Git branching strategy for an analytics engineering team. Ideally, nothing should be able to be removed and nothing needs to be added. If you disagree, leave a comment down below!
The simplest feature branching flow
The absolute simplest feature branching flow is described very well in this official dbt article. There is a main
branch off of which you create your feature branches. The main branch corresponds to the production schema, and pull requests from feature branches ideally go to temporary schemas. Only modified tables should run with state deferral to main (aka slim CI) in these temporary schemas.
Another name for this branching flow methodology is trunk-based development.
Consolidating models from multiple pull requests in one schema
Ideally, your data visualization tool should be dynamic enough to easily switch between different schemas in your data warehouse. That way, users trying to do user acceptance testing (UAT) can just point the data viz tool to the pull request schema containing the change they're reviewing.
However, if your data visualization tool doesn't support easily switching between schemas (e.g. Tableau), the best you can do for user acceptance testing (UAT) is to consolidate just certain models in a single schema. The simplest way to perform this consolidation is to add the following logic to your generate_schema_name macro:
...
{%- if target.name == "pull-request" and "show-in-uat" in node.tags -%}
uat
{%- else -%}
...
Let's break down how the above code works.
- It checks whether we're running in a pull request. To make this check work, you'll have to set the target name to "pull-request" in your CI job definition for pull requests in dbt Cloud.
- It checks whether the model is tagged "show-in-uat".
- If all of the above are true, it sets the schema name to "uat".
Feel free to change any of the above with your own target name, tag name, and UAT schema name.
It's also important to turn on state deferral for pull request jobs so that only modified models will run. Ideally then as long as different pull requests modify different models tagged "show-in-uat", they should all be able to coexist in the "uat" schema. All models without that tag will still exist in the corresponding pull request schemas and be isolated from one another.
If you're modifying a model that someone else is UATing in a different PR and don't want to cause a conflict, you can just remove the tag from your model, and it won't overwrite their UAT table/view.
Adding a pre-production environment
Starting with one main branch for production and doing all your testing in feature branches/pull requests will probably work just fine for small to medium sized organizations. Larger organizations may need additional environments. However, that doesn't mean that a radically different Git branching flow is needed!
By default, trunk-based development advocates for release branches. However, I believe that breeding all those branches is overkill for data teams, and instead advocate for the simpler release from trunk methodology.
If we want to have a pre-production environment, we can still utilize the main branch for both the production and the pre-production environments by tagging commits that are ready for production release.
This way, the latest commit in main is always pushed to pre-production environment #1, whatever you want to call it. When the team feels confident that the change can be pushed to production, they tag that commit with a production release version number, and a separate CI process that watches for tags then pushes the changes to the production environment.
Now you have your temporary schemas, one for each pull request, the 'bleeding edge' main that points to the pre-production environment, and the production environment that only gets updated when a new version is tagged in main.
Note that the CI that's built into dbt Cloud can support the basic feature branching flow out of the box, but it doesn't support git tag release strategies. This pushes folks unnecessarily into creating multiple branches for multiple environments in situations where simple tags would have served them just fine.
One option is to manually update the environment's "custom branch" in dbt Cloud settings every time there's a new release.
The other option is to do the same thing, but automatically via the API as soon as a commit in the main branch is tagged. There's an existing project that can be used as a reference. I'll update the post if I get around to creating an automated process myself.
Adding a second pre-production environment
For some organizations, one pre-production environment is not enough, and they insist on two. This is still easy to do! We just have to utilize release candidate tags for the new pre-production environment.
Suppose our pre-pre-production environment is named TEST, and our pre-production environment is named STAGE. TEST corresponds to the latest commit in the main branch - that's the 'bleeding edge'. STAGE corresponds to the latest release candidate tag on the main branch. In semantic versioning, this would be achieved by adding the suffix -rc.N
to the name of the release it's targeting. For example, if our goal is to create production release v12.0.0
, our STAGE environment commits would be tagged v12.0.0-rc.1
, then v12.0.0-rc.2
, and so on. Suppose on v12.0.0-rc.5
we finally feel confident enough to push to production. We would then add the tag v12.0.0
to the same commit, which would constitute a full release and then be automatically deployed to production.
Need more environments/branches/options?
There are many Git branching models and variations to choose from. See this overview to learn more. Do you believe you've found an even simpler flow? Let me know in the comments!
Top comments (0)