The Two-Phase Query
A RAG query is two operations, glued together with a prompt:
- Retrieve. Embed the question (with
task_type=RETRIEVAL_QUERY), thenSELECT ... ORDER BY embedding <=> %s LIMIT Kto get the K nearest chunks. - Generate. Build a prompt that says: "Here is some context. Use only this context to answer the question. If the context doesn't contain the answer, say so." Send it to Gemini.
That's the whole game. Everything else — re-ranking, query rewriting, hybrid search — is optimization. Get the two-line version working first.
The Retrieval Function
# src/retrieve.py
"""Find the K most similar chunks to a question."""
from .db import get_conn
from .embeddings import embed
def retrieve(question: str, k: int = 5) -> list[dict]:
"""Return the K chunks closest to the question."""
[query_vec] = embed([question], task="RETRIEVAL_QUERY")
query_str = "[" + ",".join(f"{x:.7f}" for x in query_vec) + "]"
with get_conn() as conn:
cur = conn.cursor()
cur.execute(
"""
SELECT
id,
source,
chunk_index,
content,
embedding <=> %s AS distance
FROM chunks
ORDER BY embedding <=> %s
LIMIT %s
""",
(query_str, query_str, k),
)
rows = cur.fetchall()
return [
{
"id": r[0],
"source": r[1],
"chunk_index": r[2],
"content": r[3],
"distance": float(r[4]),
}
for r in rows
]A few things worth knowing:
<=>is cosine distance inpgvector. It pairs with thevector_cosine_opsoperator class we built our HNSW index against. If you used<->(L2) or<#>(negative inner product) you'd skip the index entirely.- The vector appears twice in the query — once in the
SELECT(to return the distance) and once in theORDER BY(to drive the index). Postgres won't dedupe that for you. - Lower distance = more similar. Cosine distance ranges from 0 (identical direction) to 2 (opposite). Anything under ~0.4 is usually relevant; over ~0.7 is usually noise.
Smoke Test the Retriever
# scripts/test_retrieve.py
from src.retrieve import retrieve
results = retrieve("What is retrieval-augmented generation?", k=3)
for r in results:
print(f"[{r['distance']:.3f}] {r['source']}#{r['chunk_index']}")
print(r['content'][:200], "\n")uv run python -m scripts.test_retrieveYou should see three results, each with a distance, source filename, chunk index, and a preview. Eyeball the previews — they should look topically relevant. If they don't, your ingest data probably doesn't cover the question yet. Embed more text.
The Generation Step
Now the LLM. We use google-genai because it works against Vertex AI without a separate API key — same ADC that talks to Cloud SQL is what authenticates Gemini.
# src/generate.py
"""Ask Gemini to answer a question using only the provided context."""
from google import genai
from google.genai.types import GenerateContentConfig
from .config import load_config
_cfg = load_config()
_client = genai.Client(vertexai=True, project=_cfg.project, location=_cfg.region)
MODEL = "gemini-flash-latest"
SYSTEM_PROMPT = """You answer questions from the provided context.
Rules:
- Use only the context. Do not use outside knowledge.
- If the context does not contain the answer, reply exactly: "I don't know based on the provided documents."
- Quote short phrases from the context when helpful.
- Cite sources at the end as: Sources: <filename>#<chunk_index>, ...
"""
def generate_answer(question: str, chunks: list[dict]) -> str:
context = "\n\n---\n\n".join(
f"[{c['source']}#{c['chunk_index']}]\n{c['content']}" for c in chunks
)
prompt = (
f"Context:\n{context}\n\n"
f"Question: {question}\n\n"
"Answer:"
)
response = _client.models.generate_content(
model=MODEL,
contents=prompt,
config=GenerateContentConfig(
system_instruction=SYSTEM_PROMPT,
temperature=0.1,
),
)
return response.text or "I don't know based on the provided documents."Why these choices:
gemini-flash-latest— fast and cheap, well-suited to RAG where the model is doing synthesis, not deep reasoning. Swap ingemini-pro-latestlater if you want longer/sharper answers and don't mind paying ~10× more per query.temperature=0.1— RAG answers should be deterministic; you don't want creativity. Keep it just above 0 so the model doesn't get stuck.- Strict system prompt — every sentence in there is fighting a specific failure mode: hallucinating ("Use only the context"), making up a confident wrong answer ("If the context does not contain the answer..."), and unsourced claims ("Cite sources at the end").
The [<source>#<chunk_index>] format above the chunk is what teaches the model to cite. Gemini is good at echoing the format it sees in the context.
The Top-Level ask
One function that ties the two together. This is what the API endpoint will call.
# src/ask.py
"""End-to-end: question → retrieved chunks → grounded answer."""
from .generate import generate_answer
from .retrieve import retrieve
def ask(question: str, k: int = 5) -> dict:
chunks = retrieve(question, k=k)
answer = generate_answer(question, chunks)
return {
"question": question,
"answer": answer,
"sources": [
{"source": c["source"], "chunk_index": c["chunk_index"], "distance": c["distance"]}
for c in chunks
],
}Try It
# scripts/test_ask.py
import json
from src.ask import ask
print(json.dumps(ask("What is retrieval-augmented generation?"), indent=2))uv run python -m scripts.test_askYou should get something like:
{
"question": "What is retrieval-augmented generation?",
"answer": "Retrieval-augmented generation (RAG) is a technique that... [...] Sources: rag-wikipedia.txt#0, rag-wikipedia.txt#1",
"sources": [
{"source": "rag-wikipedia.txt", "chunk_index": 0, "distance": 0.142},
{"source": "rag-wikipedia.txt", "chunk_index": 1, "distance": 0.218}
]
}If you see "I don't know based on the provided documents." and the distances are above ~0.6, the retriever is failing to find good chunks — usually because the question is about something you didn't ingest. Try a question grounded in your sample text.
When the Answer Looks Wrong
Three knobs cover ~90% of RAG quality issues:
| Symptom | Knob |
|---|---|
| Model invents facts | Lower temperature, tighten system prompt, add "if unsure, say I don't know" |
| Model misses obvious answers | Increase k (try 8 or 10), check the chunk size in step 5 — small chunks fragment context |
| Top-K results aren't relevant | Re-check task_type matches between ingest and query, increase chunk overlap |
Don't reach for re-rankers or hybrid search until those three are dialed in. They almost always solve the problem on their own.
What You Have Now
- A retrieval function backed by an HNSW index — millisecond latency even with hundreds of thousands of chunks
- A generation function with a strict system prompt that forces grounding
- A single
ask()that does the whole thing in two API calls - A working end-to-end pipeline. Locally. As yourself.
Next: wrap it in HTTP and put it on the internet.
Reference: pgvector distance operators · google-genai SDK on Vertex AI · Gemini system instructions