BigDAG: A Simple Tool for Managing BigQuery Workflows

Hey everyone,

I wanted to share a tool I've been working on called BigDAG. It's born out of my own frustrations with managing data transformations in BigQuery, and I figured others might find it useful too.

The Core Problem: BigQuery Complexity

BigQuery is fantastic for data analysis, but when you start building real pipelines, things can get messy. You end up with:

SQL Files Everywhere: Queries for tables, views, and transformations scattered across different files.
Manual Dependency Tracking: Figuring out the correct order to run scripts becomes a chore.
Repetitive bq Commands: Creating datasets, tables, and views involves running similar commands repeatedly.
Deployment Headaches: Pushing changes to BigQuery can be a manual and error-prone process.

I needed a way to bring some order to this chaos, to define my workflows as code, and to automate the deployment process. That's the motivation behind BigDAG.

BigDAG: A Practical Solution

BigDAG is a Python tool that helps you manage data workflows in BigQuery. It's not trying to be a full-fledged data orchestration platform; it's a practical tool to solve a specific problem. Here's the basic idea:

Organize Your Files: You structure your SQL scripts, schema definitions, and external table definitions in a folder. This folder represents your data pipeline.
Dependency Detection: BigDAG automatically detects dependencies between your SQL scripts by looking at the queries. If a view uses another table, BigDAG knows it needs to be created first.
Manual Overrides: If automatic detection isn't enough, you can specify dependencies in a deps.yaml file.
Simple CLI: BigDAG provides a command-line interface (CLI) to execute your workflows. You can create, update, or delete objects in BigQuery with a single command.
Basic Templating: BigDAG uses simple templating to inject your project ID and dataset name into your SQL queries.

Key Features

Dependency Management: Automatically figures out the order to run your scripts.
Code-Based Workflows: Define your pipelines using files and YAML.
CLI for Deployment: Deploy changes to BigQuery with a single command.
Dry Run Option: See the commands that will be executed before running them.
Recreation Support: Easily recreate your entire dataset and all objects.

A Simple DAG Example

Let's look at a simple example of how you might structure your DAG folder:

my_dag/
├── raw/
│   └── sales.sheet.def.json
│   └── sales.sheet.schema.json
├── trusted/
│   └── sales.view.sql
└── refined/
    └── monthly_sales.table.sql

Here's what the files might contain:

my_dag/raw/sales.sheet.def.json:

{
    "sourceFormat": "GOOGLE_SHEETS",
    "sourceUris": [
      "https://docs.google.com/spreadsheets/d/FAKE_SPREADSHEET_ID"
    ],
    "googleSheetsOptions": {
        "range": "sales!A1:Z",
        "skipLeadingRows": 1
    }
}

my_dag/raw/sales.sheet.schema.json:

{
  "fields": [
    {"name": "sale_date", "type": "DATE", "mode": "REQUIRED"},
    {"name": "product_id", "type": "STRING", "mode": "REQUIRED"},
    {"name": "amount", "type": "NUMERIC", "mode": "REQUIRED"}
  ]
}

my_dag/trusted/sales.view.sql:

-- SQL script for creating the sales view
SELECT
    *
FROM
    raw_sales;

my_dag/refined/monthly_sales.table.sql:

-- SQL script for creating the monthly_sales table
SELECT
    date_trunc('month', sale_date) AS month,
    SUM(amount) AS total_sales
FROM
    `{{project_id}}.{{dataset}}.trusted_sales`
GROUP BY
    month;

In this example:

sales.sheet.def.json and sales.sheet.schema.json define an external table based on a Google Sheet.
sales.view.sql creates a view on top of the raw sales data.
monthly_sales.table.sql creates a table based on the trusted sales view.

BigDAG will automatically infer that monthly_sales.table.sql depends on trusted_sales, and that trusted_sales depends on raw_sales.

How to Use It (Quickly)

Install:

git clone https://github.com/feroult/bigdag.git
cd bigdag
pip install -e .

Set up your DAG folder: Organize your SQL scripts and definitions.

Run the CLI:

bigdag --folder path/to/your/dag --project your_project_id --dataset your_dataset_name

The Goal: Making Life Easier

BigDAG is about making data workflows more manageable. It's about:

Reducing Manual Errors: Automating repetitive tasks to avoid mistakes.
Improving Workflow Clarity: Making it easier to understand and maintain your data pipelines.
Saving Time: Spending less time on infrastructure and more time on analysis.

Check it Out

If you're dealing with similar challenges in BigQuery, you might find BigDAG useful. It's open-source, so feel free to take a look, contribute, or just use it as is.

https://github.com/feroult/bigdag

Hope it helps!