You may be a data engineer or data quality engineer who uses the dynamic ETL tool dbt from Fishtown Analytics. If so, it's possible that you've incorporated a few automated schema tests into your dbt project. dbt's built-in testing framework is a powerful yet light way to validate data transformations, sources, and the data itself.
As you likely know, you can do even more with dbt and its related tools. Packages in dbt are a nice way to add useful and more complex functionality to your data transformation project with what are essentially dependencies.
In this case, we'll focus on the dbt_utils package, which is an aptly named utilities package officially made and maintained by the same folks who develop dbt. I'll show you how to add the package if you don't already know how, and then share some examples of dbt_utils automated schema tests I find useful.
Note: One great advantage of dbt_utils schema tests is that they can test at the model or column level.
Let's do this!
Create a new file at the root level in your dbt project, and name it packages.yml.
Get the latest package code from this official page, then copy and paste it into your packages.yml file.
Today the code is:
packages: - package: fishtown-analytics/dbt_utils version: 0.7.0
Open your Command Line or Terminal wherever you control your dbt project and send commands. Run the following command to install the dbt_utils package on your machine so you can use it.
Much like dbt's built-in testing, dbt_utils automated schema tests should be placed under a column or model being tested under the "tests" configuration. That's essentially how they are integrated!
Now you can use a whole host of new dbt automated schema tests and even more in the dbt_utils package. For now I'll show you a couple of my favorite dbt_utils schema tests.
Recency is a configurable schema test that checks to see if the timestamp column referenced was changed within the timeframe. I.E. you can see if the model loads every day, hour, week, etc.
models: - name: model_name tests: - dbt_utils.recency: datepart: day field: created_at interval: 1
I find equal_rowcount to be a fantastic test for checking to make sure nothing changed in my sources during ingest or transformation. If I expect two models to contain the same number of rows, it's an easy reference to both from the target model.
models: - name: model_name tests: - dbt_utils.equal_rowcount: compare_model: ref('other_table_name')
dbt_utils is a truly powerful package that can add a lot of functionality to your project, and it costs nothing to install. Other schema tests include validation for cardinality, unique with SQL conditions, values not accepted, row equality, row count, and even more.
I hope this was useful for you, and that you are excited by the prospect of adding more cool functions to your dbt project like I am! If you'd like to share more of your experience with dbt or data engineering, please feel free to comment or get in touch with me at any time. I have a dbtTestExamples repository on GitHub that is open for anyone to try their hand at dbt tests using Google BigQuery.