What's the Concept?
A pile of CREATE OR REPLACE TABLE queries in a folder is a maintenance nightmare. Which runs first? What depends on what? How do I test a change before deploying it? How do I roll back?
dbt (data build tool) is the answer the industry standardized on around 2020. It's simple: write each table definition as a .sql file with templating, declare its dependencies, and the tool builds a DAG, runs them in order, and tests them along the way.
GCP also ships Dataform, a native equivalent integrated with BigQuery. The mental model is identical; the choice is mostly about whether you want a cloud-managed UI (Dataform) or an open-source ecosystem (dbt). For this course we describe dbt because the patterns are more widely transferable.
How It Works
A dbt project mirrors the medallion layout:
dbt-project/
├── models/
│ ├── staging/ ← silver models (one .sql per silver table)
│ │ ├── stg_orders.sql
│ │ ├── stg_customers.sql
│ │ └── stg_tickets.sql
│ └── marts/ ← gold models (one .sql per agent use case)
│ ├── billing_agent_context.sql
│ └── support_agent_context.sql
├── tests/
│ ├── orders_no_negative_amounts.sql
│ └── customers_unique_id.sql
└── dbt_project.ymlEach model is a SELECT statement; the surrounding CREATE TABLE is generated:
-- models/staging/stg_orders.sql
{{ config(
materialized='table',
partition_by={'field': 'created_at', 'data_type': 'timestamp'},
cluster_by=['customer_id']
) }}
WITH parsed AS (
SELECT
JSON_VALUE(payload, '$.id') AS order_id,
JSON_VALUE(payload, '$.customer') AS customer_id,
SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64) AS amount_cents,
TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64)) AS created_at,
_ingestion_timestamp AS _ingested_at
FROM {{ source('bronze', 'stripe_charges_raw') }}
WHERE _ingestion_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT * FROM parsed
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingested_at DESC) = 1Gold models reference silver models with {{ ref('stg_orders') }} instead of hard-coded table names. dbt parses these references, builds the DAG, and runs models in correct order. If stg_orders.sql fails, dependent gold models are skipped — no half-built warehouse.
Tests are SQL queries that return rows when something is wrong:
-- tests/orders_no_negative_amounts.sql
SELECT order_id FROM {{ ref('stg_orders') }}
WHERE amount_cents < 0dbt test runs all tests; any row returned is a failure. Standard tests (unique, not_null, relationships) are declared in YAML, not SQL.
Why It Matters
- The DAG is explicit. New engineers can see the entire transformation pipeline in one diagram. Implicit-dependency bugs disappear.
- Tests catch regressions before production. A silver-layer schema change that breaks gold gets flagged in the CI run, not at 3am.
- Backfills become safe.
dbt run --select stg_orders+rebuildsstg_ordersand everything downstream of it, in order. - The warehouse becomes a code repository. PRs, code review, history, blame — all the engineering practices apply.
Key Technical Details
- dbt has free open-source and paid Cloud variants. The open-source CLI plus a GitHub Action for CI is the cheap production setup.
- Dataform (Google's native) is free with BigQuery and offers similar features with tighter GCP integration; pick it if you want fewer moving parts.
- Models default to
viewmaterialization. For anything queried more than once, usetable(full rebuild) orincremental(append/merge new data only). - Source freshness checks (
dbt source freshness) detect upstream pipeline lag — useful as an early warning.
Common Misconceptions
"dbt is just SQL with extra steps." It's SQL with a dependency graph, a test framework, documentation generation, and CI integration. The "extra steps" are the actual engineering practice.
"We're too small for dbt." dbt is more useful at small scale than at large — that's when changes are most frequent and least documented. Start as soon as you have more than ~5 transformation queries.
"dbt and Dataform are competing tools." They serve the same role with different ecosystems. Pick one based on team preference; don't run both.
Connections to Other Concepts
04-incremental-and-idempotent-pipelines— How dbt'sincrementalmaterialization works in practice.- Course
06-pipeline-orchestration/01-orchestrating-with-cloud-composer— Wiring dbt into Airflow. - Course
07-operating-the-system/01-observability-and-data-quality-monitoring— dbt tests as the first tier of data quality monitoring.
Further Reading
- dbt Labs, dbt Best Practices Guide — Community-maintained, current, updated continuously. The closest thing to canonical for project structure, naming, and testing strategy.
- Tristan Handy et al., dbt official documentation and "dbt Fundamentals" course (free) — Start here if you've never used dbt.
- Google Cloud, "Dataform overview" + "Dataform SQLX reference" — The native equivalent inside BigQuery; free, tighter integration than dbt.
- dbt Semantic Layer (GA 2024) — Worth knowing about: declarative metric definitions on top of dbt models, queryable from BI tools and (now) agent tool calls. Maps onto the agent-mart pattern in this course's Module 05.
- Google Cloud, "BigQuery managed Apache Iceberg tables" (GA 2024) — Iceberg-formatted tables natively in BigQuery, queryable from Spark/Trino too. Worth knowing about if you'll ever need cross-engine reads on the same gold tables.