Add the FastAPI Layer
Three endpoints: a health check, the ask endpoint, and an ingest endpoint we can hit after the service is up.
# src/api.py
"""FastAPI surface for Cloud Run."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from .ask import ask as run_ask
from .ingest import gather_chunks, insert_chunks
from .embeddings import embed
app = FastAPI(title="rag-on-gcp")
class AskRequest(BaseModel):
question: str = Field(..., min_length=1, max_length=2000)
k: int = Field(5, ge=1, le=20)
@app.get("/")
def health() -> dict:
return {"status": "ok"}
@app.post("/ask")
def ask(req: AskRequest) -> dict:
try:
return run_ask(req.question, k=req.k)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/admin/ingest")
def ingest() -> dict:
"""Re-ingest everything under data/. Useful for re-running after a deploy."""
chunks = gather_chunks()
if not chunks:
return {"ingested": 0, "message": "No .txt files in data/."}
rows = []
BATCH = 50
for start in range(0, len(chunks), BATCH):
batch = chunks[start : start + BATCH]
vectors = embed([c[2] for c in batch], task="RETRIEVAL_DOCUMENT")
rows.extend((s, i, c, v) for (s, i, c), v in zip(batch, vectors))
insert_chunks(rows)
return {"ingested": len(rows)}In a real product /admin/ingest would live behind auth (we cover that in Step 9). For now, --allow-unauthenticated plus the obscurity of your service URL is the only thing protecting it. Don't put real customer data here yet.
Run it locally one last time before deploying:
uv run uvicorn src.api:app --reload --port 8080In another terminal:
curl -X POST http://localhost:8080/ask \
-H "Content-Type: application/json" \
-d '{"question": "What is RAG?"}'Same JSON shape as Step 6. Good. Kill the dev server.
The Dockerfile
Cloud Run accepts any container that listens on $PORT. Keep it short.
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
# Install uv for fast dependency resolution at build time.
RUN pip install --no-cache-dir uv
# Install deps first so this layer caches when only your code changes.
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
# Then copy the app.
COPY src ./src
COPY data ./data
# Cloud Run sets PORT. Default to 8080 for local docker runs.
ENV PORT=8080
# uvicorn reads $PORT at runtime.
CMD ["sh", "-c", "uv run uvicorn src.api:app --host 0.0.0.0 --port $PORT"]A .dockerignore keeps the build context lean:
.venv
__pycache__
.env
.git
*.pyc
.pytest_cacheCreate a Service Account for the Service
Cloud Run services run as a service account. We make one with exactly the permissions this service needs — nothing more.
export SA_NAME="rag-runtime"
export SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
# 1. Create the account
gcloud iam service-accounts create $SA_NAME \
--display-name="RAG service runtime"
# 2. Talk to Cloud SQL
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/cloudsql.client"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/cloudsql.instanceUser"
# 3. Call Vertex AI (embeddings + Gemini)
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/aiplatform.user"
# 4. Read the secret we made in Step 3
gcloud secrets add-iam-policy-binding db-connection-name \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/secretmanager.secretAccessor"Now create the Postgres IAM user for that service account — same IAM-auth pattern we used for ourselves, applied to a non-human identity:
gcloud sql users create $SA_EMAIL \
--instance=rag-db \
--type=cloud_iam_service_account(Note: Postgres truncates IAM service-account usernames to 63 chars, dropping .gserviceaccount.com. The DB user actually ends up being rag-runtime@${PROJECT_ID}.iam. We pass the right value via DB_USER below.)
Deploy
One command. Cloud Build packages the Dockerfile, pushes to Artifact Registry, and rolls out to Cloud Run.
# The DB_USER for the service account
export RUNTIME_DB_USER="${SA_NAME}@${PROJECT_ID}.iam"
gcloud run deploy rag \
--source . \
--region=$REGION \
--service-account=$SA_EMAIL \
--allow-unauthenticated \
--memory=1Gi \
--cpu=1 \
--timeout=120 \
--max-instances=3 \
--set-env-vars="GOOGLE_CLOUD_PROJECT=${PROJECT_ID}" \
--set-env-vars="GOOGLE_CLOUD_REGION=${REGION}" \
--set-env-vars="INSTANCE_CONNECTION_NAME=${INSTANCE_CONNECTION_NAME}" \
--set-env-vars="DB_NAME=rag" \
--set-env-vars="DB_USER=${RUNTIME_DB_USER}"The first deploy takes ~3 minutes: Cloud Build needs to build the image. Subsequent deploys are ~1 minute.
When it finishes you'll see:
Service URL: https://rag-xxxxxxxx-uc.a.run.appSave that.
export SERVICE_URL=$(gcloud run services describe rag --region=$REGION --format='value(status.url)')Hit Your Live Service
Health check first:
curl $SERVICE_URL/
# {"status":"ok"}Re-ingest into the live database (the container ships with data/):
curl -X POST $SERVICE_URL/admin/ingest
# {"ingested": 17}Ask a question:
curl -X POST $SERVICE_URL/ask \
-H "Content-Type: application/json" \
-d '{"question": "What is RAG?"}' | jqSame JSON shape as your local run in Step 6. Now coming from the public internet, backed by your Cloud SQL instance, generated by Vertex AI Gemini.
What's Actually Happening Underneath
Every auth hop uses the service account's IAM token. No passwords, no API keys, no shared secrets. The only sensitive thing that crosses the wire is the question itself.
Streaming Logs
When something looks wrong:
gcloud run services logs tail rag --region=$REGIONHit your endpoint in another terminal and watch the requests roll in.
Cost in Production
| Resource | At rest | With moderate traffic |
|---|---|---|
| Cloud Run | $0 (scales to zero) | ~$0.40/M requests + per-second CPU/RAM |
Cloud SQL db-f1-micro | ~$8/mo (always-on) or ~$0.15/mo (stopped) | Same — billed by the second |
| Vertex AI embeddings | $0 | $0.025 per 1M characters |
| Vertex AI Gemini Flash | $0 | $0.075 per 1M input tokens, $0.30 per 1M output |
| Egress | $0 | ~$0 (same region as DB) |
A bursty hobby app with a few hundred questions a day rounds to "the cost of leaving Cloud SQL running."
Update the Service
Anytime you change code, re-ingest data, or bump a dependency:
gcloud run deploy rag --source . --region=$REGIONCloud Run does zero-downtime traffic shifting on every revision. Old container stays up until the new one passes health checks.
What You Have Now
- A FastAPI app with three endpoints
- A small Dockerfile that builds in seconds and runs anywhere
- A dedicated service account with least-privilege IAM
- A public HTTPS URL that answers questions from your documents
You shipped a RAG app on Google Cloud. Last step covers what comes next.
Reference: Deploy from source to Cloud Run · Cloud Run service identity · Cloud SQL connector with Cloud Run · IAM database authentication for service accounts