What's the Concept?
This capstone walks through a single, end-to-end pipeline: external product docs (Markdown files in a GitHub repo) flowing into a Cloud Storage bronze, refining through silver and gold in BigQuery, embedding into vectors, and surfacing through a tool an agent can call. The scope is deliberately small — one source, one entity, one agent use case — so every step is concrete and runnable.
The scenario: you're building a product-support agent for "Brain Drip" itself. The agent should be able to answer questions like "How do I configure the search index?" by consulting the actual docs in the repo. That means the docs need to be ingested, refined, embedded, and queryable.
How It Works
The full pipeline:
GitHub repo: braindrip-docs
/docs/**/*.md (source of truth)
│
│ hourly batch puller
│ (Cloud Run + Cloud Scheduler)
▼
┌────────────────────────────────────────┐
│ GCS bronze │
│ source=github/entity=docs/ │
│ ingestion_date=2026-05-14/ │
│ page=00000.jsonl.gz │
└────────────────────────────────────────┘
│
│ Dataform (BigQuery)
▼
┌────────────────────────────────────────┐
│ BigQuery silver.docs │
│ one row per doc, parsed frontmatter, │
│ cleaned markdown body │
└────────────────────────────────────────┘
│
│ Dataform (BigQuery)
▼
┌────────────────────────────────────────┐
│ BigQuery gold.docs_chunks │
│ one row per ~500-token chunk, │
│ embedding VECTOR<768>, │
│ metadata for hybrid retrieval │
└────────────────────────────────────────┘
│
│ Cloud Run tool service
▼
┌────────────────────────────────────────┐
│ Agent retrieval tool │
│ search_docs(query, top_k) │
│ hybrid: structured filter + VECTOR_SEARCH │
└────────────────────────────────────────┘
│
│ (lives outside the course scope —
│ the agent itself is in a sibling course)
▼
┌────────────────────────────────────────┐
│ Anthropic / Vertex AI agent runtime │
└────────────────────────────────────────┘The three following lessons walk through each major segment:
- Lesson 02 — Ingesting product docs (Cloud Run puller, GCS bronze layout, hourly schedule).
- Lesson 03 — Refining into an agent-ready corpus (silver + gold + embeddings in Dataform).
- Lesson 04 — Wiring the retrieval tool (Cloud Run service, hybrid search query, tool schema).
By the end you'll have a working pattern you can transplant onto any docs-flavored use case: knowledge bases, runbooks, FAQs, support articles.
Why This Specific Project?
Three reasons documentation is a good capstone target:
- It exercises both retrieval modes. Docs need semantic search (the agent doesn't know what's in them by ID) and structured filtering (by section, version, tags) — exactly the hybrid pattern.
- The refinement step is non-trivial but bounded. Chunking, frontmatter parsing, link rewriting — enough complexity to be educational, not so much it dominates the project.
- The output is immediately useful. A docs-retrieval agent is something every product team wants. The capstone produces a real artifact, not a toy.
Why It Matters
- It collapses theory into a runnable example. Every concept in the course shows up in this pipeline, but the scope is small enough to keep in your head.
- It's the smallest meaningful agent stack. One source, one entity, one tool — but everything that's there is what you'd ship in production.
- It's transferable. Swap "docs" for "support tickets," "product specs," "internal wiki" — the architecture stays identical.
Key Technical Details
- The full pipeline can be built for under $20/month total at the scale of "a few thousand doc pages."
- Total code: roughly 200 lines of Python (the puller), 150 lines of SQL (the Dataform models), 80 lines of Python (the tool service). Plus IaC.
- End-to-end freshness target: hourly. From "doc merged to main" to "agent can quote it" is ~5–60 minutes.
- This pipeline has been the bootstrap for several real Brain Drip experiments — including the search index that already powers some of these lessons.
Common Misconceptions
"I'd need a real agent framework to validate this." No — the tool can be exercised with curl or a Python script. The agent framework is downstream and orthogonal to the data work.
"This is just RAG." It's the data engineering substrate for RAG. The retrieval bit is the visible six inches; the 100 metres of pipe behind it is what this capstone teaches.
Connections to Other Concepts
02-step-one-ingesting-product-docs— Putting Module 02 patterns into practice.03-step-two-refining-into-an-agent-ready-corpus— Module 04 + Module 05 in code.04-step-three-wiring-the-retrieval-tool— Module 05's contract pattern made real.05-state-of-the-practice-and-further-reading— A curated reading list covering current research and recent GCP product launches; use it as the bibliography for the whole course.
Further Reading
- Anthropic, "Introducing Contextual Retrieval" (Sept 2024) — Before deploying this capstone in production, add their context-prefix step to the chunking. Single biggest retrieval quality win available. https://www.anthropic.com/research/contextual-retrieval
- Anthropic, "Building Effective Agents" (Dec 2024) — The agent-side patterns this capstone's retrieval tool slots into. https://www.anthropic.com/research/building-effective-agents
- Anthropic Tool Use Cookbook + Vertex AI Agent Builder docs — Two current references for how the agent runtime calls a tool of the shape this capstone builds.
- LlamaIndex / LangChain chunking documentation — Production chunkers (sentence-aware, semantic, hierarchical) you'd graduate this lesson's naive splitter to.
- The Brain Drip GitHub repo — The docs we're ingesting are the actual markdown files this site is built from.
- Module 08 lesson
05-state-of-the-practice-and-further-reading— Full curated bibliography.