What's the Concept?

For twenty years the canonical downstream consumer of a data pipeline was a BI tool — Looker, Tableau, a Mode dashboard — and the engineering tradeoffs reflected that. Tables were optimized for aggregate queries, joins, and human eyeballs. Latency budgets were "a dashboard should load in under 5 seconds." The pipeline's job was to make the analyst's query fast.

For an AI agent, the downstream consumer is an LLM that calls a retrieval tool, gets back rows or chunks, and has to fit the answer into a context window before it can reason. The constraints shift in three specific ways:

  1. Token budget replaces render budget. A 200-row query result is fine for a dashboard and almost unusable as agent context. Pipelines now optimize for "smallest useful payload."
  2. Semantics replace dimensions. The agent doesn't filter by "region = APAC" — it asks "what did customers in Asia complain about last week?" The pipeline needs to expose both structured columns and embedded text.
  3. Freshness pressure changes shape. A dashboard re-runs nightly. An agent's tool call might fire any time the user types. The cache-invalidation rules tighten.

How It Works

The warehouse-to-agent translation, side by side:

Classic warehouseAgent-ready warehouse
Star schema, denormalized for joinsWider, agent-ready tables with embedded context
Daily/hourly batch refreshMixed: batch for history, streaming for "now"
Output → dashboardOutput → tool call response (JSON or rows)
Optimized for SELECT … GROUP BYOptimized for SELECT … WHERE id IN (...) + ANN
Quality = "the numbers tie out"Quality = "the agent gives the right answer"

The data engineer's job in this world isn't to abandon the warehouse — it's to add new endpoints. The BI marts stay where they are. Alongside them you add agent marts: tables shaped for retrieval, embeddings co-located with rows, summaries pre-computed, freshness contracts honored.

Why It Matters

  • You inherit decades of practice. Dimensional modeling, slowly-changing dimensions, conformed dimensions — all still apply. Don't reinvent.
  • The pipeline becomes the product API. An agent's tool call to BigQuery is a public interface in everything but name. Schema changes are breaking changes.
  • The same dataset can serve both consumers. A well-designed silver-layer fact table serves the analyst's dashboard and the agent's retrieval tool. Don't fork your pipeline per consumer.

Key Technical Details

  • A reasonable agent-mart target is ≤200 rows or ≤4,000 tokens per tool call. Pre-aggregate or pre-summarize anything wider.
  • Co-locate the embedding column with the source text in the same table. Avoid one-table-for-text, another-table-for-vectors splits — they break consistency guarantees during refresh.
  • Pre-compute "the answer the agent will probably want." If users repeatedly ask for top-N or summaries, materialize those.

Common Misconceptions

"We need a separate database for agents." You don't. A row in BigQuery with a VECTOR column is a perfectly usable retrieval primitive. Resist the urge to bolt on a separate vector DB until you've actually hit BigQuery's limits.

"The agent will just write SQL." Tool-calling agents that emit free-form SQL against your warehouse work in demos and fail in production — they hit cost, perf, and safety walls. Pre-built tools with constrained queries are the production pattern.

Connections to Other Concepts

Further Reading

  • Ralph Kimball, "The Data Warehouse Toolkit" (3rd ed.) — Still the reference on dimensional modeling.
  • Google Cloud, "Modern Data Stack on Google Cloud" whitepaper — How GCP frames the warehouse layers.
  • LangChain, "Anatomy of an Agent Retrieval Tool" — Practical patterns for shaping retrieval responses.