The Goal

By the end of this blueprint you will have:

  • A Cloud SQL Postgres instance with the pgvector extension turned on, holding your documents as searchable embeddings.
  • A Python ingest script that reads a folder of .txt files, chunks them, embeds each chunk with Vertex AI, and writes the results into Postgres.
  • A FastAPI service that takes a question, finds the most relevant chunks, and asks Gemini to write the answer grounded in those chunks.
  • That same service deployed to Cloud Run behind a public HTTPS URL.

You will end up curling your own Cloud Run URL like:

curl -X POST https://rag-xxxxx-uc.a.run.app/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is our refund policy?"}'

…and getting back an answer pulled directly from the docs you ingested.

Architecture

One database. One LLM provider. One runtime. That's the point.

Why This Stack

ChoiceWhy
Cloud SQL for PostgreSQLA standard Postgres database. You can query it with psql, back it up, and migrate it like any other database. No new query language to learn.
pgvectorA Postgres extension. Vector search lives in the same database as your rows. One backup, one connection string, one set of credentials.
Vertex AI (text-embedding-005 + Gemini Flash)First-party Google models. You authenticate once with your Google account — no separate API keys for an embedding provider and a chat provider.
Cloud SQL Python ConnectorConnects to Cloud SQL using your Google identity. No public IP, no firewall rules, no proxy binaries to download.
FastAPIThe minimum Python web framework that makes a JSON endpoint feel boring. Two decorators and you have a service.
Cloud RunOne command to deploy. Scales to zero when nobody's asking. Same runtime in dev and prod.

What is not in the stack: a separate vector database (Pinecone, Qdrant, Weaviate), a separate auth system, a Docker compose file, a Kubernetes cluster. Cloud SQL is the storage, Cloud Run is the compute, Vertex AI is the brain.

What RAG Actually Is

RAG — Retrieval-Augmented Generation — is a two-phase pattern:

  1. Indexing (offline). Read your documents, split them into chunks (a few sentences each), turn each chunk into a vector with an embedding model, and write the vectors into a database that can do similarity search.
  2. Querying (online). When a user asks a question, embed the question, find the chunks whose vectors are closest, and pass those chunks plus the question to an LLM. The LLM writes the answer using only what you handed it.

That second step is what makes the answer grounded. Without RAG, the model answers from its training data, which doesn't know about your refund policy. With RAG, it answers from the actual policy text — pulled fresh from your database every time.

Making It Context-Aware

The naive version of RAG above will work, but it has a ceiling. The same question from a backend engineer and a finance lead probably wants different chunks back. A question asked today probably wants more recent docs than ones from 2019. And a chunk like "the limit is 30 per minute" is meaningless without knowing which API and which tier it's about.

"Context-aware" is the umbrella term for the three things you do to break through that ceiling:

  1. Task-typed embeddings. Use one task type when embedding documents and a different one when embedding the question. The embedding model produces slightly different vectors for "this is content to be retrieved" vs. "this is a question doing the retrieving" — and using the right one on each side measurably improves recall. We do this in Step 5.
  2. Metadata-filtered retrieval. Store structured fields alongside every chunk (category, tags, publish date) and add a WHERE clause to the retrieval query. Postgres pushes the filter down to the index, so this costs almost nothing at query time but lets you scope a question to "only the backend team's docs, only from the last 90 days." Step 7 covers this end-to-end.
  3. Contextual chunking (optional). Before embedding a chunk, prepend a short summary of where that chunk lives in the document — section, surrounding topic, whatever helps. The chunk's embedding now encodes its context, not just its words. Sidebar in Step 7.

You don't need all three to ship — the basic pipeline in Steps 5 and 6 is genuinely useful on its own. But the gap between a demo and a real internal tool is almost always one of these three.

What We're Optimizing For

Two things:

  • You should be able to delete the whole project in one command. Every GCP resource we create lives under one project ID. gcloud projects delete cleans everything up — instance, secrets, service, the works.
  • The code should be production-shaped, not toy-shaped. That means using IAM-based DB auth instead of passwords in env vars, using Secret Manager for anything sensitive, and using the same container in dev and prod. You shouldn't have to throw any of this away when you're ready to ship for real.

The Companion Repo

Every step below is also captured in a runnable repo: github.com/maraja/rag-on-cloud-sql-pgvector. You can either follow the blueprint and build the code yourself, or clone the repo and use the blueprint as commentary while you wire it up to your own GCP project. Same files either way.

What's Coming

Nine short steps:

  1. What we're building (you're here)
  2. Set up your Google Cloud project — install gcloud, create a project, enable billing
  3. Create the Cloud SQL instance — Postgres with pgvector turned on
  4. Project scaffold + connect from your laptop — Python project, talk to the DB
  5. Ingest — chunk, embed, insert
  6. Query — similarity search + Gemini answer
  7. Make it context-aware — metadata schema, filtered retrieval, contextual chunking
  8. Wrap in FastAPI + deploy to Cloud Run — one command, public URL
  9. What's next — auth, evals, scaling, IaC

Cost Heads-Up

We use the smallest Cloud SQL instance available (db-f1-micro, ~$8/month if you leave it on 24/7). Cloud Run scales to zero, so it costs nothing while idle. Vertex AI embeddings are roughly $0.025 per million characters; Gemini Flash is fractions of a cent per query.

Stop the Cloud SQL instance when you're not using it (covered in Step 3) and the cost drops to storage only — pennies per month. Delete the project when you're done and it goes to zero.


Reference: Cloud SQL for PostgreSQL · pgvector on Cloud SQL · Vertex AI text embeddings · Cloud Run quickstart