What's the Concept?
GCP has a lot of services, and most data engineering tasks can be done five different ways. To keep this course coherent we pick one canonical option per job. You can substitute later — the patterns transfer — but the lessons assume the stack below.
How It Works
The reference stack for this course:
LAYER SERVICE ROLE IN THE PIPELINE
───── ─────── ────────────────────
Ingest Cloud Run + APIs custom HTTP pullers
Pub/Sub event streams + buffering
Datastream CDC from Postgres/MySQL
Storage Transfer Service bulk file moves into GCS
Bronze Cloud Storage (GCS) raw, immutable, partitioned
Silver / Gold BigQuery SQL-native warehouse
Dataflow streaming / heavy transforms
Dataform (or dbt) versioned SQL pipelines
Embed / Serve Vertex AI Embeddings API text-embedding-gecko etc.
BigQuery VECTOR + ANN in-warehouse vector search
Vertex AI Vector Search managed ANN at scale
Orchestrate Cloud Composer (Airflow) cross-service DAGs
Cloud Scheduler + Eventarc lighter event/cron triggers
Operate Cloud Logging + Monitoring pipeline observability
IAM + VPC-SC security perimeter
Cloud DLP PII detection / redactionFor most of the course you'll touch four services in earnest: GCS (bronze landing), BigQuery (silver / gold + structured retrieval), Vertex AI (embeddings + vector search), and Cloud Composer or Cloud Run (the thing that runs the pipeline). The rest are situational.
Why It Matters
- GCP's strongest opinion is that BigQuery is the warehouse. This is unusually concentrated — on AWS the equivalent role splits across Redshift, Athena, and S3 — and it's an advantage. Most of your transformation logic lives in one place.
- Vertex AI is the embedding + vector tier, not a separate database. Treating it as a managed feature of the warehouse, not a competing system, simplifies the architecture a lot.
- You can do the entire pipeline on the cheap, then scale. GCS, BigQuery on-demand, and Vertex AI embeddings all have generous free tiers. Production-grade stacks usually start under $50/month and only grow when usage justifies it.
Key Technical Details
- BigQuery's on-demand pricing is per-TB scanned, not per-table-stored. Partitioning + clustering is how you keep that bill rational.
- Pub/Sub guarantees at-least-once delivery; downstream processors need to be idempotent (we'll cover this in Module 02).
- Cloud Composer is a managed Airflow, with the same DAG semantics. It has cold-start latency and a non-trivial minimum cost; for small pipelines, Cloud Scheduler + Cloud Run is often cheaper.
- Vertex AI's
text-embedding-geckoproduces 768-dim vectors; BigQuery'sVECTORcolumn supports up to 2048 dims and indexes them with IVF or ScaNN.
Common Misconceptions
"You need Dataflow for everything." You don't. Dataflow is the right tool for high-volume streaming or complex stateful transforms. For most batch SQL transforms, BigQuery itself plus Dataform/dbt is simpler and cheaper.
"You need a separate vector database." Not at the scale of typical agent workloads. BigQuery's native vector search handles tens of millions of vectors comfortably; you only graduate to Vertex AI Vector Search when query latency or scale forces it.
Connections to Other Concepts
- Course
02-ingestion-patterns/*— Detailed coverage of the ingest services. - Course
05-serving-data-to-agents/*— Vertex AI Embeddings + BigQuery ANN as the retrieval tier. - Course
06-pipeline-orchestration/*— Composer, Dataflow, Eventarc compared.
Further Reading
- Google Cloud, "Data Analytics" product family overview.
- "Google Cloud Reference Architectures: Data lake to data warehouse" — Official narrative for this stack.
- Lak Lakshmanan & Jordan Tigani, "Google BigQuery: The Definitive Guide" (O'Reilly) — Deep reference on the warehouse layer.