TLDR;
I used dbt's macro
and on-run-end
hook to create UDFs in BigQuery.
I rencetly had a uses case where we were migrating a customer warehouse to dbt. UDFs were widely used and we wanted to preserve them to make the migration easier.
However dbt does not have any special way of managing UDFs. This mean that you need to write the DDL (data definition language) toCREATE
andDROP
the UDFs yourself 😱
Here is a solution that I borrowed and adjusted from dbt's forum:
The code
dbt/macros/udfs/clean_dates.sql
:
{% macro create_clean_dates() %}
CREATE OR REPLACE FUNCTION
{{ target.schema }}.clean_dates(d STRING)
RETURNS DATE AS (
SAFE.PARSE_DATE(
"%Y/%m/%d",
REGEXP_REPLACE(d, r"\-", "/")
)
)
{% endmacro %}
- one file per UDF makes it easier for a developer to quickly find the code when they need to make changes
dbt/macros/create_udfs.sql
:
{% macro create_udfs() %}
create schema if not exists {{target.schema}};
{{create_clean_dates()}};
...
{% endmacro %}
- I use PyCharm's
Edit-> Sort Lines
to keep UDFs sorted alphabetically - I add the semicolon
;
here but you can do that in the macro as well, doesn't matter, as long as you are consistent.
dbt/dbt_project.yml
:
...
on-run-start:
- '{{ create_udfs() }}'
- These two lines make sure that the UDFs are created before any dbt models are run/built. The only downside is that it adds some time to your dbt build.
Naming conventions
We have implemented the following naming conventions to make support easier:
For the example in our UDF clean_dates
we do the following:
- File has the same name as the UDF:
clean_dates.sql
- The macro is called
create_clean_dates
as it makes it easier to distinguish the UDF from the macro - All UDFs are created in the same dataset
target.schema
, which will default todbt
. This makes it easier to
References:
https://discourse.getdbt.com/t/using-dbt-to-manage-user-defined-functions/18
Top comments (1)
Hi Ivan, thank you for this post!
I wonder if you have tried creating a UDF that queries another available source in your project using an input parameter. For example, I am trying to adapt a SQL Server UDF that reads a table 'Projects' from my raw database and returns a Boolean result based on the input variable 'id'. The function is the following:
I have tried adapting this UDF following your blog post. I am using DBT using SparkSQL for Databricks. I created the following macros:
my_project/macros/udfs/fn_ProjectIsAvailable.sql
my_project/macros/create_udfs.sql
where:
{{ source('raw_db', 'projects') }}
is the same Project table from the SQL Server database that I already ingested into a Databricks schema 'raw_db' using DBT. It is basically a transactional catalogue of projects with their ids.and
I also added this entry to my
dbt_project.yml
file:Now, I would like to use this UDF/macro in a new DBT model where I intend to read my source
{{ source('raw_db', 'projects') }}
in this way:my_project/models/dwh_Projects.sql
Unfortunately, when I compile the preview on this model, I realised that the 'Id' variable is not being parsed to this query:
Please notice how the
p.Id =
receives a blank.I don't really know how to solve this and I would really appreciate your feedback. Many thanks.