DEV Community

Ivan N
Ivan N

Posted on

Managing UDFs in dbt

TLDR;
I used dbt's macro and on-run-end hook to create UDFs in BigQuery.

I rencetly had a uses case where we were migrating a customer warehouse to dbt. UDFs were widely used and we wanted to preserve them to make the migration easier.

However dbt does not have any special way of managing UDFs. This mean that you need to write the DDL (data definition language) toCREATE andDROP the UDFs yourself 😱

Here is a solution that I borrowed and adjusted from dbt's forum:

The code

dbt/macros/udfs/clean_dates.sql:

{% macro create_clean_dates() %}
    CREATE OR REPLACE FUNCTION 
    {{ target.schema }}.clean_dates(d STRING) 
    RETURNS DATE AS (
    SAFE.PARSE_DATE(
    "%Y/%m/%d",
    REGEXP_REPLACE(d, r"\-", "/")
    )
    )
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode
  • one file per UDF makes it easier for a developer to quickly find the code when they need to make changes

dbt/macros/create_udfs.sql:

{% macro create_udfs() %}

create schema if not exists {{target.schema}};
{{create_clean_dates()}};
...
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode
  • I use PyCharm's Edit-> Sort Lines to keep UDFs sorted alphabetically
  • I add the semicolon ; here but you can do that in the macro as well, doesn't matter, as long as you are consistent.

dbt/dbt_project.yml:

...
on-run-start:
  - '{{ create_udfs() }}'
Enter fullscreen mode Exit fullscreen mode
  • These two lines make sure that the UDFs are created before any dbt models are run/built. The only downside is that it adds some time to your dbt build.

Naming conventions

We have implemented the following naming conventions to make support easier:

For the example in our UDF clean_dates we do the following:

  1. File has the same name as the UDF: clean_dates.sql
  2. The macro is called create_clean_dates as it makes it easier to distinguish the UDF from the macro
  3. All UDFs are created in the same dataset target.schema, which will default to dbt. This makes it easier to

References:
https://discourse.getdbt.com/t/using-dbt-to-manage-user-defined-functions/18

Top comments (1)

Collapse
 
ramcdfe1985 profile image
Ramiro Cevallos

Hi Ivan, thank you for this post!

I wonder if you have tried creating a UDF that queries another available source in your project using an input parameter. For example, I am trying to adapt a SQL Server UDF that reads a table 'Projects' from my raw database and returns a Boolean result based on the input variable 'id'. The function is the following:

CREATE FUNCTION fn_ProjectIsAvailable
(
    @Id BIGINT
)
RETURNS BIT
AS
BEGIN

RETURN ISNULL((SELECT TOP 1 1
            FROM Projects p WHERE p.Id= @Id
            ),0)

END
Enter fullscreen mode Exit fullscreen mode

I have tried adapting this UDF following your blog post. I am using DBT using SparkSQL for Databricks. I created the following macros:

my_project/macros/udfs/fn_ProjectIsAvailable.sql
my_project/macros/create_udfs.sql

{% macro create_fn_ProjectIsAvailable() %}

CREATE OR REPLACE FUNCTION {{ target.schema }}.fn_ProjectIsAvailable(Id BIGINT)
RETURNS BIT
AS

COALESCE
(
    (
        SELECT 1
        FROM {{ source('raw_db', 'projects') }} p 
        WHERE 
            p.Id = {{ Id }}
        LIMIT 1
    ), 0
)

{% endmacro %}
Enter fullscreen mode Exit fullscreen mode

where:
{{ source('raw_db', 'projects') }} is the same Project table from the SQL Server database that I already ingested into a Databricks schema 'raw_db' using DBT. It is basically a transactional catalogue of projects with their ids.

and

{% macro create_udfs() %}

create schema if not exists {{target.schema}};
{{create_fn_ProjectIsAvailable()}};

{% endmacro %}
Enter fullscreen mode Exit fullscreen mode

I also added this entry to my dbt_project.yml file:

on-run-start:
  - '{{ create_udfs() }}'
Enter fullscreen mode Exit fullscreen mode

Now, I would like to use this UDF/macro in a new DBT model where I intend to read my source {{ source('raw_db', 'projects') }} in this way:

my_project/models/dwh_Projects.sql

SELECT
{{target.schema}}.fn_ProjectIsAvailable(Id) AS IsProjectAvailable
FROM {{ source('raw_db', 'projects') }}
Enter fullscreen mode Exit fullscreen mode

Unfortunately, when I compile the preview on this model, I realised that the 'Id' variable is not being parsed to this query:

SELECT
COALESCE
(
    (
        SELECT 1
        FROM `raw_db`.`projects` p 
        WHERE 
            p.Id = 
        LIMIT 1
    ), 0
) AS IsProjectAvailable
FROM  `raw_db`.`projects`
Enter fullscreen mode Exit fullscreen mode

Please notice how the p.Id = receives a blank.

I don't really know how to solve this and I would really appreciate your feedback. Many thanks.