dbt for Versioned Transforms

What's the Concept?

A pile of CREATE OR REPLACE TABLE queries in a folder is a maintenance nightmare. Which runs first? What depends on what? How do I test a change before deploying it? How do I roll back?

dbt (data build tool) is the answer the industry standardized on around 2020. It's simple: write each table definition as a .sql file with templating, declare its dependencies, and the tool builds a DAG, runs them in order, and tests them along the way.

GCP also ships Dataform, a native equivalent integrated with BigQuery. The mental model is identical; the choice is mostly about whether you want a cloud-managed UI (Dataform) or an open-source ecosystem (dbt). For this course we describe dbt because the patterns are more widely transferable.

How It Works

A dbt project mirrors the medallion layout:

dbt-project/
├── models/
│   ├── staging/           ← silver models (one .sql per silver table)
│   │   ├── stg_orders.sql
│   │   ├── stg_customers.sql
│   │   └── stg_tickets.sql
│   └── marts/             ← gold models (one .sql per agent use case)
│       ├── billing_agent_context.sql
│       └── support_agent_context.sql
├── tests/
│   ├── orders_no_negative_amounts.sql
│   └── customers_unique_id.sql
└── dbt_project.yml

Each model is a SELECT statement; the surrounding CREATE TABLE is generated:

-- models/staging/stg_orders.sql
{{ config(
    materialized='table',
    partition_by={'field': 'created_at', 'data_type': 'timestamp'},
    cluster_by=['customer_id']
) }}
 
WITH parsed AS (
  SELECT
    JSON_VALUE(payload, '$.id')           AS order_id,
    JSON_VALUE(payload, '$.customer')     AS customer_id,
    SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64) AS amount_cents,
    TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64)) AS created_at,
    _ingestion_timestamp                  AS _ingested_at
  FROM {{ source('bronze', 'stripe_charges_raw') }}
  WHERE _ingestion_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT * FROM parsed
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingested_at DESC) = 1

Gold models reference silver models with {{ ref('stg_orders') }} instead of hard-coded table names. dbt parses these references, builds the DAG, and runs models in correct order. If stg_orders.sql fails, dependent gold models are skipped — no half-built warehouse.

Tests are SQL queries that return rows when something is wrong:

-- tests/orders_no_negative_amounts.sql
SELECT order_id FROM {{ ref('stg_orders') }}
WHERE amount_cents < 0

dbt test runs all tests; any row returned is a failure. Standard tests (unique, not_null, relationships) are declared in YAML, not SQL.

Why It Matters

The DAG is explicit. New engineers can see the entire transformation pipeline in one diagram. Implicit-dependency bugs disappear.
Tests catch regressions before production. A silver-layer schema change that breaks gold gets flagged in the CI run, not at 3am.
Backfills become safe. dbt run --select stg_orders+ rebuilds stg_orders and everything downstream of it, in order.
The warehouse becomes a code repository. PRs, code review, history, blame — all the engineering practices apply.

Key Technical Details

dbt has free open-source and paid Cloud variants. The open-source CLI plus a GitHub Action for CI is the cheap production setup.
Dataform (Google's native) is free with BigQuery and offers similar features with tighter GCP integration; pick it if you want fewer moving parts.
Models default to view materialization. For anything queried more than once, use table (full rebuild) or incremental (append/merge new data only).
Source freshness checks (dbt source freshness) detect upstream pipeline lag — useful as an early warning.

Common Misconceptions

"dbt is just SQL with extra steps." It's SQL with a dependency graph, a test framework, documentation generation, and CI integration. The "extra steps" are the actual engineering practice.

"We're too small for dbt." dbt is more useful at small scale than at large — that's when changes are most frequent and least documented. Start as soon as you have more than ~5 transformation queries.

"dbt and Dataform are competing tools." They serve the same role with different ecosystems. Pick one based on team preference; don't run both.

Connections to Other Concepts

04-incremental-and-idempotent-pipelines — How dbt's incremental materialization works in practice.
Course 06-pipeline-orchestration/01-orchestrating-with-cloud-composer — Wiring dbt into Airflow.
Course 07-operating-the-system/01-observability-and-data-quality-monitoring — dbt tests as the first tier of data quality monitoring.