DEV Community

Cover image for The Practical Guide to Utilizing DBT Packages for Data Transformation
iamtodor for FreshBooks

Posted on • Edited on

The Practical Guide to Utilizing DBT Packages for Data Transformation

Table of content

What are packages

dbt packages are collections of macros, models, and other resources that are used to extend the functionality of dbt. Packages can be used to share common code and resources across multiple dbt projects, and can be published and installed from the dbt Hub, from GitHub or can be stored locally and installed by specifying the path to the project.

In dbt, libraries like these are called packages.

Why use it

dbt packages are so powerful because so many of the analytic problems we encountered are shared across organizations.

There are a few general benefits to using packages in dbt:

  1. Reusability: packages allow you to reuse code across multiple projects and models. This can save you a lot of time and effort, as you don't have to copy and paste the same code into multiple places.

  2. Collaboration: packaging your models in a package allows multiple people to work on the same models at the same time. You can use version control systems like git to manage changes to the models, and use tools like the dbt test command to ensure that the models are correct and reliable.

  3. Sharing: packaging your models or macros in a package allows you to share them with others. You can publish your package on the dbt Hub or on GitHub, and others can install and use your models in their own dbt projects.

  4. Managing: packages make it easier to manage your codebase. You can use version control to track changes to your package, and you can easily install and update packages in your dbt project.

  5. Modularity: packaging your models in a package allows you to break your data pipeline into smaller, more manageable pieces, which are easier to understand and maintain. This could make it streamline development and upkeep your dbt project over time. This is especially useful if you are working on a large project with many different models and transformations.

Overall, using packages can help you to build more efficient, maintainable, and scalable data pipelines with dbt.

For example, if your aim is to extract the day of the week, there is no sense to reinvent the wheel and develop this macro on your own. Rather, we might want to find the right package and make use of it {{ dbt_date.day_of_week(column_name) }}.

Local packages

In dbt, you can use local packages to organize and reuse code within a single dbt project. Local packages are stored within your project directory and are only available to the models in that project. The best use-case for local packages is some module that you want to live in the same repository, nearby to the main project.

To create a reusable local package do the following:

  1. Consider you have the following dbt project dir structure
>>> tree -L 1 .
.
├── data
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
└── target
Enter fullscreen mode Exit fullscreen mode
  1. Create packages dir, so here we would put our first local package.
mkdir packages
Enter fullscreen mode Exit fullscreen mode
  1. Create packages.yml file, so here we would link our first local package.
touch packages.yml
Enter fullscreen mode Exit fullscreen mode
  1. Before moving on please verify that you have the following dir structure
>>> tree -L 1 .
.
├── data
├── dbt_modules
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
├── snapshots
└── target
Enter fullscreen mode Exit fullscreen mode
  1. Jump into packages dir and init your package with the name local_utils. The name of package is arbitrary.
cd packages
dbt init local_utils
Enter fullscreen mode Exit fullscreen mode

It will create a package with the following structure:

>>> tree local_utils
local_utils
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── snapshots
└── tests
Enter fullscreen mode Exit fullscreen mode
  1. Next, you need to change the project name in dbt_project.yml from my_new_project to a meaningful and self-explainable name. This name will be used further as macro or model references. Let's call it local_utils.
  2. Specify our local package in the before-mentioned packages.yml as follows:
packages:
  - local: /opt/dbt/packages/local_utils/
Enter fullscreen mode Exit fullscreen mode

Make sure that you provide your absolute path to the packages. Otherwise, it would not work.

Save the packages.yml file and run the dbt deps command to install the package. This will link the package and make it available to your dbt models.

Here is an example of what the dbt deps command might look like when you install your local package:

>>> dbt deps
Running with dbt=0.21.1
Installing /opt/dbt/packages/local_utils/
  Installed from <local @ /opt/dbt/packages/local_utils/>
Enter fullscreen mode Exit fullscreen mode

Now you can observe a newly created dbt_modules dir, that contains binary file local_utils. It means than our local package is ready to be used.

>>> tree dbt_modules
dbt_modules
└── utils -> /opt/dbt/packages/local_utils/
Enter fullscreen mode Exit fullscreen mode

Dbt hub packages

In dbt, you can use packages from the dbt Hub to share your code with others and to reuse code from other users in your own projects. The dbt Hub is a community-driven library of packages that you can use to extend the functionality of dbt.

Probably, the best examples of third-party packages driven by the community would be:

There are a few benefits to using dbt Hub packages:

  1. Reusable code: dbt Hub packages allow you to reuse code that has been shared by other users, teams and companies. This can save you a lot of time and effort, as you don't have to write the same logic from scratch, test and maintain it.

  2. The major advantage of any open-source:

  • Community support: When you use packages from the dbt Hub, you can benefit from the support and expertise of the dbt community. If you have questions or run into issues with a package, you can ask for help on the dbt community forums or Slack channel.
  • Collaboration: By sharing your own packages on the dbt Hub, you can make your code available to other users. This can help to foster collaboration and improve the overall quality of your code.

Overall, using dbt Hub packages can help you to build more efficient, maintainable, and scalable data pipelines with dbt, and to collaborate with others in the dbt community.

To install a package from the dbt Hub in your dbt project, you will need to add the package to your packages.yml file.

Here is the basic process:

  1. Go to the dbt Hub and search for the package you want to install.
  2. Click on the package to view its details.
  3. Copy the package name and version from the installation instructions.
  4. Open your packages.yml file and add the package name and version to the packages list. It should look something like this:
packages:
  - name: dbt-labs/dbt_utils
    version: 0.7.6
Enter fullscreen mode Exit fullscreen mode

Save the packages.yml file and run the dbt deps command to install the package. This will download the package and make it available to your dbt models.

Here is an example of what the dbt deps command might look like when you install dbt hub package:

>>> dbt deps
Installing dbt-labs/dbt_utils@0.7.6
  Installed from version 0.7.6
Enter fullscreen mode Exit fullscreen mode

Unlike of local package, hub package was downloaded to dbt_modules dir physically.

>>> tree -L 2 dbt_modules
dbt_modules
└── dbt_utils
    ├── CHANGELOG.md
    ├── LICENSE
    ├── README.md
    ├── RELEASE.md
    ├── dbt_project.yml
    ├── docker-compose.yml
    ├── etc
    ├── integration_tests
    ├── macros
    └── run_test.sh
Enter fullscreen mode Exit fullscreen mode

Verify packages are installed

To verify that a package is installed in your dbt project, you can check the packages.yml file and run the dbt deps command.

  1. Check the packages.yml file: This file lists all of the packages that are installed in your dbt project. Look for the name of the package you want to verify. If it is listed in the packages list, then it is installed.

  2. Run the dbt deps command:

    1. This command will show you a list of all of the packages that are installed in your dbt project. Look for the name of the package you want to verify. If it is listed, then it is installed.
    2. In the root dbt project dir, you observe a new dir dbt_modules/ which contains the compiled packages that are ready to be used. NOTE: dir dbt_modules/ has to be added to .gitignore.
>>> tree -L 1 .
.
├── data
├── dbt_modules
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
├── snapshots
└── target
Enter fullscreen mode Exit fullscreen mode

If your packages.yml file contains package that is not installed then you would not be able to run any dbt command:

>>> dbt list
Encountered an error:
Compilation Error
  dbt found 1 package(s) specified in packages.yml, but only 0 package(s) installed in dbt_modules. Run dbt deps to install package dependencies.
Enter fullscreen mode Exit fullscreen mode

So this is our guarantee that in runtime we would not have any issues related to the package installation.

Macros usage

In dbt, you can use packages to define custom macros that can be called from your dbt models. Here is an example of how you might use a package to define a custom macro.

Here are a few examples of how you might use macros in dbt to perform common data transformations.

For instance, lets create the following macros in our local package under local_utils/macros/cents_to_dollars.sql

{% macro cents_to_dollars(column_name, precision=2) %}
    ({{ column_name }} / 100)::numeric(16, {{ precision }})
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode

Next we can call our macros as {{ local_utils.cents_to_dollars(your_column_name) }}. The local_utils package names comes from the name in our package dbt_project.yml file.

Usage a macros from dbt hub packages is pretty much the same. Imagine we want to generate a surrogate key based on a few columns. This is the functional dbt-utils we previously installed provides: {{ dbt_utils.generate_surrogate_key([field_a, field_b[,...]]) }}

So the macros usage pattern from the third-party package is {{ package_name.macros_name() }}.

Models usage

As we created our own local package local_utils as a prerequisite, which has the following structure:

>>> tree packages/local_utils
local_utils
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── snapshots
└── tests
Enter fullscreen mode Exit fullscreen mode

The most important function in dbt is ref(); its impossible to build even moderately complex models without it. ref() is how you reference one model within another inside your package. Here is how this looks in practice:

select
    *
from
    {{ref("model_a")}}
Enter fullscreen mode Exit fullscreen mode

So if we would like to reference my_first_dbt_model from my_second_dbt_model within local_utils package then we do the following:

select
    *
from
    {{ ref("my_first_dbt_model") }}
Enter fullscreen mode Exit fullscreen mode

If we want to reference my_first_dbt_model from our main project then we need to slightly change the way we call it. There is also a two-argument variant of the ref function. With this variant, you can pass both a package name and model name to ref to avoid ambiguity:

select
    *
from
    {{ ref("package_name", "model_name") }}
Enter fullscreen mode Exit fullscreen mode

Our particular case would be as:

select
    *
from
    {{ ref("local_utils", "my_first_dbt_model") }}
Enter fullscreen mode Exit fullscreen mode

Note: The package_name should only include the name of the package, not the maintainer. For example, if we use the dbt-labs/dbt-utils package, type dbt-utils in that argument, and not dbt-labs/dbt-utils.

dbt_modules under the hood

The dbt_modules directory is a directory that is used by dbt to store packages and their models. When you install a package using the dbt deps command, the package and its models are downloaded and stored in the dbt_modules directory.

The dbt_modules directory is located in the root directory of your dbt project. It contains subdirectories for each installed package, and each package directory contains the packages models, macros, and other resources.

The way dbt installs local and dbt hub packages is different.

Considering to have the following package.yml content:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.7.6
  - local: /opt/dbt/packages/local_utils/
Enter fullscreen mode Exit fullscreen mode

You would have the following modules under generated dbt_modules dir:

>>> tree -L 2 dbt_modules
dbt_modules
├── dbt_utils
│   ├── CHANGELOG.md
│   ├── LICENSE
│   ├── README.md
│   ├── RELEASE.md
│   ├── dbt_project.yml
│   ├── docker-compose.yml
│   ├── etc
│   ├── integration_tests
│   ├── macros
│   └── run_test.sh
└── utils -> /opt/dbt/packages/local_utils/
Enter fullscreen mode Exit fullscreen mode

As I mentioned before, it creates a symlink for local packages, and for the dbt hub package, it simply copies all the needed files in the same name folder.

We use Google Cloud Composer to orchestrate all the transformation jobs. We basically copy our project to GCP bucket with gsutil -m rsync. Unfortunately, it does not support symbolic links:

Since gsutil rsync is intended to support data operations (like moving a data set to the cloud for computational processing) and it needs to be compatible both in the cloud and across common operating systems, there are no plans for gsutil rsync to support operating system-specific file types like symlinks.
Enter fullscreen mode Exit fullscreen mode

Taken from gutils rsync's documentation Be Careful When Synchronizing Over Os-Specific File Types (Symlinks, Devices, Etc.)

The possible solution is to compress everything locally to archive, copy it to the bucket, and then unpack it to composer:

tar -czvf dbt-project.tar.gz dbt-project
gsutil -m rsync dbt-project.tar.gz gs://$BUCKET/prefix
Enter fullscreen mode Exit fullscreen mode

Here’s what those switches actually mean:

-c: Create an archive.
-z: Compress the archive with gzip.
-v: Display progress in the terminal while creating the archive, also known as “verbose” mode. The v is always optional in these commands, but it’s helpful.
-f: Allows you to specify the filename of the archive.
Enter fullscreen mode Exit fullscreen mode

These are pitfalls that we met when working with dbt packages.

Disclaimer

  • All this experience applies to dbt v0.21.1
  • I am aware of since v1.0 they changed the default value to dbt_packages instead of dbt_modules
  • I like to think that most of the guide still appears to be applicable to the latest dbt version

Contact

If you found this article helpful, I invite you to connect with me on LinkedIn. I am always looking to expand my network and connect with like-minded individuals in the data industry. Additionally, you can also reach out to me for any questions or feedback on the article. I'd be more than happy to engage in a conversation and help out in any way I can. So don’t hesitate to contact me, and let’s connect and learn together.

Top comments (0)