Step Two: Refining Into an Agent-Ready Corpus

What's the Concept?

Bronze docs are useless to an agent as-is — they're long Markdown files. The refinement step does three things:

Parse — extract frontmatter (title, tags, section) and clean the markdown body.
Chunk — split each doc into ~500-token chunks with ~50 tokens of overlap, so the retrieval can return a precise passage.
Embed — compute one vector per chunk using Vertex AI.

The output is gold.docs_chunks: one row per chunk, with text, metadata, and a vector column ready for hybrid retrieval.

How It Works

Two Dataform models do all of this.

Silver: silver_docs.sqlx — one row per doc, cleaned:

config {
  type: "table",
  bigquery: { partitionBy: "DATE(_pulled_at)" }
}
 
SELECT
  doc_path,
  -- Extract frontmatter title (One-Line Summary line)
  REGEXP_EXTRACT(content, r"\\*\\*One-Line Summary\\*\\*:\\s*(.+?)\\n")  AS summary,
  REGEXP_EXTRACT(content, r"^#\\s+(.+?)\\n")                              AS title,
  -- Strip code fences for embedding purposes (keep for retrieval response)
  REGEXP_REPLACE(content, r"```[\\s\\S]*?```", "")                       AS body_for_embed,
  content                                                                AS body_full,
  content_hash,
  _commit,
  _pulled_at
FROM ${ref({schema: "bronze", name: "github_docs_raw"})}
WHERE content IS NOT NULL
QUALIFY ROW_NUMBER() OVER (PARTITION BY doc_path ORDER BY _pulled_at DESC) = 1

Gold: gold_docs_chunks.sqlx — chunked and embedded:

config {
  type: "incremental",
  uniqueKey: ["chunk_id"],
  bigquery: { partitionBy: "DATE(_chunked_at)", clusterBy: ["doc_path"] }
}
 
WITH base AS (
  SELECT
    s.doc_path, s.title, s.summary, s.content_hash, s._commit,
    -- Naive chunker: split into ~2000-char chunks with 200-char overlap
    -- Production would use a smarter sentence-aware splitter
    STRUCT(
      chunk_offset AS offset,
      SUBSTR(s.body_for_embed, chunk_offset, 2200) AS text
    ) AS chunk
  FROM ${ref("silver_docs")} s,
  UNNEST(GENERATE_ARRAY(1, LENGTH(s.body_for_embed), 2000)) AS chunk_offset
 
  ${when(incremental(), `
    WHERE s.content_hash NOT IN (
      SELECT content_hash FROM ${self()}
    )
  `)}
),
with_ids AS (
  SELECT
    *,
    TO_HEX(SHA256(CONCAT(doc_path, "::", CAST(chunk.offset AS STRING))))
      AS chunk_id
  FROM base
  WHERE LENGTH(chunk.text) > 200      -- skip near-empty trailing chunks
),
embedded AS (
  SELECT
    w.*,
    emb.ml_generate_embedding_result AS embedding
  FROM with_ids w,
  LATERAL (
    SELECT ml_generate_embedding_result
    FROM ML.GENERATE_EMBEDDING(
      MODEL ${ref({schema: "embedding_models", name: "text_embedding_005"})},
      (SELECT w.chunk.text AS content),
      STRUCT(TRUE AS flatten_json_output)
    )
  ) emb
)
SELECT
  chunk_id,
  doc_path,
  title,
  summary,
  chunk.offset       AS chunk_offset,
  chunk.text         AS chunk_text,
  embedding,
  content_hash,
  _commit,
  CURRENT_TIMESTAMP() AS _chunked_at
FROM embedded

The incremental logic — WHERE s.content_hash NOT IN (SELECT content_hash FROM ${self()}) — is the critical cost control. After the initial run, only chunks whose underlying doc has changed get re-embedded. A typical hourly refresh re-embeds 0–20 chunks instead of all 5,000.

Dataform scheduling glues this together — both models tagged capstone:docs, both running hourly, with the embedding model itself versioned as a separate Dataform model:

# workflow_settings.yaml
schedules:
  - name: capstone_docs
    cron: "10 * * * *"
    target:
      tags: ["capstone:docs"]
    notifications:
      onFailure: ["data-alerts@myco.com"]

Why It Matters

Chunking is a product decision, not an implementation detail. Chunks too small → no context; too large → low retrieval precision. ~500 tokens with ~50 overlap is a defensible default.
Incremental embedding is the only way the bill scales. Re-embedding every chunk every hour is technically possible and financially insane. Hash-gated incremental is the production pattern.
The gold table is the contract. Tomorrow's retrieval tool reads from gold.docs_chunks. Its shape is the agent's API.

Key Technical Details

This naive chunker splits on character boundaries; production-grade chunkers split on paragraphs or sentences (LangChain's RecursiveCharacterTextSplitter, LlamaIndex's SentenceSplitter). Worth upgrading once the rest works.
text-embedding-005 produces 768-dim vectors and costs ~ $0.0001 p er 1000 in p u t c ha r s . A f e w - t h o u s an d - c h u nk cor p u scos t s ab o u t$ 0.20 for the initial run, then near-zero per hourly refresh.
BigQuery's VECTOR_SEARCH requires a vector index for ANN; create it after gold is populated: CREATE VECTOR INDEX docs_chunks_idx ON gold.docs_chunks(embedding).
Capture the embedding model version in the gold table so re-embedding can be triggered when you upgrade. Without it, mixed-vintage embeddings degrade search quality silently.

Common Misconceptions

"Chunking is solved." It's not — chunk strategy meaningfully affects retrieval quality, and the right strategy is domain-specific. Treat the chunker as a hyperparameter you'll iterate on.

"Re-embed everything on schema change." Often you don't need to. If only the metadata schema changed (new tag column), keep existing embeddings. Re-embed only on changes to the text being embedded.

"Bigger chunks are better." Bigger chunks mean each retrieved result carries more context — but also more noise, and you can fit fewer of them in the agent's context budget. Test on real queries; don't pick on intuition alone.

Connections to Other Concepts

Course 05-serving-data-to-agents/02-semantic-retrieval-embeddings-and-vector-search — The semantic-retrieval theory in this lesson's practice.
Course 04-refinement-in-bigquery/04-incremental-and-idempotent-pipelines — Where the incremental embedding pattern comes from.
04-step-three-wiring-the-retrieval-tool — Where this gold table gets consumed.