Course · 8 modules · 33 lessons · 122 min

Data Engineering for AI Agents on GCP

The pattern that makes agents trustworthy: ingest external data into a Cloud Storage lake, refine it through BigQuery, and serve it to agents via structured and semantic retrieval. End-to-end on Google Cloud, from raw bytes to agent context — with a curated 2024–2026 research reading list.

← All courses

Foundations

№ 01The Agent–Data FlywheelAI agents are only as good as the data they can reach at inference time, so the real product moat is the pipeline that keeps that data fresh, clean, and shaped for retrieval.3 min→№ 02From Warehouse to Agent ContextThe classic data warehouse pattern still applies for AI agents, but the "consumer" is now an LLM with a token budget instead of a BI dashboard with a render budget.3 min→№ 03The Medallion Pattern: Bronze, Silver, GoldA three-tier lake-and-warehouse layout — bronze for raw, silver for cleaned, gold for serve-ready — that keeps refinement steps inspectable and reversible.3 min→№ 04The GCP Data Stack at a GlanceA map of the GCP services this course relies on, what each does, and where it sits in the bronze → silver → gold flow.3 min→

Ingestion Patterns

№ 01Batch Ingestion From APIsA scheduled puller that hits an external HTTP API, paginates through the response, and lands raw JSON in GCS — the simplest and most common ingestion shape.3 min→№ 02Event Streams with Pub/SubPub/Sub is the buffered, at-least-once event bus that lets producers fire events any time and consumers process them on their own schedule.4 min→№ 03Change Data Capture from DatabasesCDC reads a source database's transaction log directly, turning every insert/update/delete into a stream of events the warehouse can mirror in near-real-time.3 min→№ 04Files and Bulk Loads Into GCSFile-based ingestion — partners drop CSV/Parquet/JSON files into a bucket, you pick them up and land them into bronze — is the boring, durable ingestion pattern that handles most enterprise data exchange.3 min→

The Raw Data Lake

№ 01Cloud Storage as a LakeGCS is the GCP-native object store, and with a consistent path convention plus lifecycle rules it functions as the bronze layer of your warehouse without any other moving parts.3 min→№ 02Bucket Layouts and PartitioningHow you arrange files inside GCS buckets determines what's queryable, what's cheap to query, and what becomes a refactor nightmare six months in — pick the layout once and commit.3 min→№ 03Schema-on-Read vs Schema-on-WriteThe choice of when to enforce a schema — at ingest time or at query time — sets the rigidity-versus-flexibility tradeoff for your entire pipeline.3 min→№ 04Data Governance From Day OneGovernance — knowing what data exists, who owns it, who can access it, and how long it lives — costs almost nothing when you set it up at the start and is painful to retrofit later.3 min→

Refinement In Bigquery

№ 01Bronze to Silver: Cleaning and ConformingThe bronze → silver step is where shapeless raw payloads become typed, deduplicated, one-row-per-thing tables you can actually rely on.3 min→№ 02Silver to Gold: Modeling for AgentsGold tables are purpose-built for one agent use case at a time — joining silver tables, pre-computing the things the agent will ask for, and dropping everything else so retrieval is fast and unambiguous.3 min→№ 03dbt for Versioned Transformsdbt (or Google's native Dataform) turns SQL transformations into version-controlled, testable code with dependency graphs — the difference between an ad-hoc warehouse and a maintained one.4 min→№ 04Incremental and Idempotent PipelinesOnce your tables grow past a hundred million rows, full rebuilds get too expensive — incremental and idempotent pipelines update only what changed, safely re-runnable on any failure.4 min→

Serving Data To Agents

№ 01Structured Retrieval: BigQuery as a ToolThe most reliable way to feed an agent is to expose a small set of narrowly-scoped functions that query gold tables — predictable, cheap, and auditable.4 min→№ 02Semantic Retrieval: Embeddings and Vector SearchEmbeddings turn arbitrary text into 768- or 1536-dim vectors; cosine-similarity search over those vectors finds rows by meaning, not by exact keys — essential when the agent doesn't know what to ask for by name.5 min→№ 03Hybrid Retrieval: Structured Plus SemanticReal-world agent queries always involve both kinds of constraint — "find similar tickets *for this customer*" — so the production pattern is structured filters first, semantic ranking second.4 min→№ 04The Retrieval Contract Between Pipeline and AgentThe set of tools you expose to the agent — names, parameter schemas, response shapes, freshness guarantees — is a formal API; design it as one and version it as one.4 min→

Pipeline Orchestration

№ 01Orchestrating with Cloud ComposerCloud Composer is GCP's managed Airflow, the standard tool for stitching together ingestion, transformation, and refresh jobs into a single dependable DAG.3 min→№ 02Dataflow for Heavy TransformsDataflow is GCP's managed Apache Beam runner — the right tool for streaming transformations, very large batch jobs, or anything stateful that BigQuery SQL can't express cleanly.3 min→№ 03Dataform and BigQuery-Native PipelinesDataform is GCP's built-in dbt — version-controlled SQL pipelines that run entirely inside BigQuery, with scheduling, dependencies, and testing as a managed service.3 min→№ 04Event-Driven Pipelines with EventarcEventarc is GCP's event routing service — it lets cloud events (GCS object uploads, Pub/Sub messages, BigQuery job completions) directly trigger Cloud Run, Cloud Functions, or Workflows without a cron in the middle.3 min→

Operating The System

№ 01Observability and Data Quality MonitoringAn agent pipeline you can't see is an agent pipeline that breaks silently — and silent breakage is much worse than the loud kind because the agent keeps confidently answering with stale or wrong data.4 min→№ 02Cost Control on BigQuery and Vertex AIThe two recurring expenses in an agent data stack are BigQuery query bytes and embedding API calls — both blow up by orders of magnitude when left unchecked, and both are easy to bound with a few discipline patterns.4 min→№ 03IAM and Security for Agent Data PathsAn agent that queries the warehouse is just another service principal — give it the least-privilege account it needs, scope its access to gold tables only, and audit every call.4 min→№ 04Handling PII and Redaction PipelinesPII the agent doesn't need shouldn't reach the agent at all; pipelines that detect, redact, or tokenize sensitive data before it enters gold tables are how you keep both compliant and useful.4 min→

Capstone Product Docs To Agent

№ 01Capstone: Overview and ArchitectureBuild a complete pipeline that ingests a product's documentation, refines it into an agent-ready corpus on GCP, and exposes a retrieval tool an agent can call — exercising every layer in the course.4 min→№ 02Step One: Ingesting Product DocsA Cloud Run service, scheduled hourly, clones the docs repo, packages each Markdown file as a JSONL record, and writes the batch into the bronze GCS bucket.4 min→№ 03Step Two: Refining Into an Agent-Ready CorpusTwo Dataform models — one silver, one gold-with-embeddings — turn the bronze JSONL into a chunked, embedded, filterable BigQuery table the retrieval tool can query directly.4 min→№ 04Step Three: Wiring the Retrieval ToolA Cloud Run service exposes a single `search_docs(query, top_k, section)` HTTP endpoint that runs hybrid retrieval over `gold.docs_chunks` and returns a typed JSON response — the agent's actual interface to the data pipeline.5 min→№ 05State of the Practice & Further ReadingA curated, annotated reading list — the books, papers, blog posts, and GCP product launches that are actively reshaping how data engineering for AI agents is done as of 2026.9 min→