Ivan N

Posted on Aug 4, 2023

Managing UDFs in dbt

#dbt #sql #bigquery #gcp

TLDR;
I used dbt's macro and on-run-end hook to create UDFs in BigQuery.

I rencetly had a uses case where we were migrating a customer warehouse to dbt. UDFs were widely used and we wanted to preserve them to make the migration easier.

However dbt does not have any special way of managing UDFs. This mean that you need to write the DDL (data definition language) toCREATE andDROP the UDFs yourself 😱

Here is a solution that I borrowed and adjusted from dbt's forum:

The code

dbt/macros/udfs/clean_dates.sql:

{% macro create_clean_dates() %}
    CREATE OR REPLACE FUNCTION 
    {{ target.schema }}.clean_dates(d STRING) 
    RETURNS DATE AS (
    SAFE.PARSE_DATE(
    "%Y/%m/%d",
    REGEXP_REPLACE(d, r"\-", "/")
    )
    )
{% endmacro %}

one file per UDF makes it easier for a developer to quickly find the code when they need to make changes

dbt/macros/create_udfs.sql:

{% macro create_udfs() %}

create schema if not exists {{target.schema}};
{{create_clean_dates()}};
...
{% endmacro %}

I use PyCharm's Edit-> Sort Lines to keep UDFs sorted alphabetically
I add the semicolon ; here but you can do that in the macro as well, doesn't matter, as long as you are consistent.

dbt/dbt_project.yml:

...
on-run-start:
  - '{{ create_udfs() }}'

These two lines make sure that the UDFs are created before any dbt models are run/built. The only downside is that it adds some time to your dbt build.

Naming conventions

We have implemented the following naming conventions to make support easier:

For the example in our UDF clean_dates we do the following:

File has the same name as the UDF: clean_dates.sql
The macro is called create_clean_dates as it makes it easier to distinguish the UDF from the macro
All UDFs are created in the same dataset target.schema, which will default to dbt. This makes it easier to

References:
https://discourse.getdbt.com/t/using-dbt-to-manage-user-defined-functions/18

Top comments (1)

Ramiro Cevallos • Nov 24 '23

Hi Ivan, thank you for this post!

I wonder if you have tried creating a UDF that queries another available source in your project using an input parameter. For example, I am trying to adapt a SQL Server UDF that reads a table 'Projects' from my raw database and returns a Boolean result based on the input variable 'id'. The function is the following:

CREATE FUNCTION fn_ProjectIsAvailable
(
    @Id BIGINT
)
RETURNS BIT
AS
BEGIN

RETURN ISNULL((SELECT TOP 1 1
            FROM Projects p WHERE p.Id= @Id
            ),0)

END

I have tried adapting this UDF following your blog post. I am using DBT using SparkSQL for Databricks. I created the following macros:

my_project/macros/udfs/fn_ProjectIsAvailable.sql
my_project/macros/create_udfs.sql

{% macro create_fn_ProjectIsAvailable() %}

CREATE OR REPLACE FUNCTION {{ target.schema }}.fn_ProjectIsAvailable(Id BIGINT)
RETURNS BIT
AS

COALESCE
(
    (
        SELECT 1
        FROM {{ source('raw_db', 'projects') }} p 
        WHERE 
            p.Id = {{ Id }}
        LIMIT 1
    ), 0
)

{% endmacro %}

where:
{{ source('raw_db', 'projects') }} is the same Project table from the SQL Server database that I already ingested into a Databricks schema 'raw_db' using DBT. It is basically a transactional catalogue of projects with their ids.

and

{% macro create_udfs() %}

create schema if not exists {{target.schema}};
{{create_fn_ProjectIsAvailable()}};

{% endmacro %}

I also added this entry to my dbt_project.yml file:

on-run-start:
  - '{{ create_udfs() }}'

Now, I would like to use this UDF/macro in a new DBT model where I intend to read my source {{ source('raw_db', 'projects') }} in this way:

my_project/models/dwh_Projects.sql

SELECT
{{target.schema}}.fn_ProjectIsAvailable(Id) AS IsProjectAvailable
FROM {{ source('raw_db', 'projects') }}

Unfortunately, when I compile the preview on this model, I realised that the 'Id' variable is not being parsed to this query:

SELECT
COALESCE
(
    (
        SELECT 1
        FROM `raw_db`.`projects` p 
        WHERE 
            p.Id = 
        LIMIT 1
    ), 0
) AS IsProjectAvailable
FROM  `raw_db`.`projects`

Please notice how the p.Id = receives a blank.

I don't really know how to solve this and I would really appreciate your feedback. Many thanks.

DEV Community

Managing UDFs in dbt

The code

Naming conventions

Top comments (1)

Read next

PL/SQL Developer, interview questions

.NET8/EF8 - Exception: Microsoft.Data.SqlClient.SqlException (0x80131904): Incorrect syntax near '$

How to Delete Duplicate Data from a MySQL Database

Common Table Expression (CTE) in PLSQL | Part 1