Step 8: Wrap in FastAPI, Deploy to Cloud Run — Build a Context-Aware RAG App on Google Cloud

Add the FastAPI Layer

Three endpoints: a health check, the ask endpoint, and an ingest endpoint we can hit after the service is up.

# src/api.py
"""FastAPI surface for Cloud Run."""
 
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
 
from .ask import ask as run_ask
from .ingest import gather_chunks, insert_chunks
from .embeddings import embed
 
app = FastAPI(title="rag-on-gcp")
 
 
class AskRequest(BaseModel):
    question: str = Field(..., min_length=1, max_length=2000)
    k: int = Field(5, ge=1, le=20)
 
 
@app.get("/")
def health() -> dict:
    return {"status": "ok"}
 
 
@app.post("/ask")
def ask(req: AskRequest) -> dict:
    try:
        return run_ask(req.question, k=req.k)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
 
 
@app.post("/admin/ingest")
def ingest() -> dict:
    """Re-ingest everything under data/. Useful for re-running after a deploy."""
    chunks = gather_chunks()
    if not chunks:
        return {"ingested": 0, "message": "No .txt files in data/."}
 
    rows = []
    BATCH = 50
    for start in range(0, len(chunks), BATCH):
        batch = chunks[start : start + BATCH]
        vectors = embed([c[2] for c in batch], task="RETRIEVAL_DOCUMENT")
        rows.extend((s, i, c, v) for (s, i, c), v in zip(batch, vectors))
 
    insert_chunks(rows)
    return {"ingested": len(rows)}

In a real product /admin/ingest would live behind auth (we cover that in Step 9). For now, --allow-unauthenticated plus the obscurity of your service URL is the only thing protecting it. Don't put real customer data here yet.

Run it locally one last time before deploying:

uv run uvicorn src.api:app --reload --port 8080

In another terminal:

curl -X POST http://localhost:8080/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is RAG?"}'

Same JSON shape as Step 6. Good. Kill the dev server.

The Dockerfile

Cloud Run accepts any container that listens on $PORT. Keep it short.

# Dockerfile
FROM python:3.12-slim
 
WORKDIR /app
 
# Install uv for fast dependency resolution at build time.
RUN pip install --no-cache-dir uv
 
# Install deps first so this layer caches when only your code changes.
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
 
# Then copy the app.
COPY src ./src
COPY data ./data
 
# Cloud Run sets PORT. Default to 8080 for local docker runs.
ENV PORT=8080
 
# uvicorn reads $PORT at runtime.
CMD ["sh", "-c", "uv run uvicorn src.api:app --host 0.0.0.0 --port $PORT"]

A .dockerignore keeps the build context lean:

.venv
__pycache__
.env
.git
*.pyc
.pytest_cache

Create a Service Account for the Service

Cloud Run services run as a service account. We make one with exactly the permissions this service needs — nothing more.

export SA_NAME="rag-runtime"
export SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
 
# 1. Create the account
gcloud iam service-accounts create $SA_NAME \
  --display-name="RAG service runtime"
 
# 2. Talk to Cloud SQL
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/cloudsql.client"
 
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/cloudsql.instanceUser"
 
# 3. Call Vertex AI (embeddings + Gemini)
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/aiplatform.user"
 
# 4. Read the secret we made in Step 3
gcloud secrets add-iam-policy-binding db-connection-name \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/secretmanager.secretAccessor"

Now create the Postgres IAM user for that service account — same IAM-auth pattern we used for ourselves, applied to a non-human identity:

gcloud sql users create $SA_EMAIL \
  --instance=rag-db \
  --type=cloud_iam_service_account

(Note: Postgres truncates IAM service-account usernames to 63 chars, dropping .gserviceaccount.com. The DB user actually ends up being rag-runtime@${PROJECT_ID}.iam. We pass the right value via DB_USER below.)

Deploy

One command. Cloud Build packages the Dockerfile, pushes to Artifact Registry, and rolls out to Cloud Run.

# The DB_USER for the service account
export RUNTIME_DB_USER="${SA_NAME}@${PROJECT_ID}.iam"
 
gcloud run deploy rag \
  --source . \
  --region=$REGION \
  --service-account=$SA_EMAIL \
  --allow-unauthenticated \
  --memory=1Gi \
  --cpu=1 \
  --timeout=120 \
  --max-instances=3 \
  --set-env-vars="GOOGLE_CLOUD_PROJECT=${PROJECT_ID}" \
  --set-env-vars="GOOGLE_CLOUD_REGION=${REGION}" \
  --set-env-vars="INSTANCE_CONNECTION_NAME=${INSTANCE_CONNECTION_NAME}" \
  --set-env-vars="DB_NAME=rag" \
  --set-env-vars="DB_USER=${RUNTIME_DB_USER}"

The first deploy takes ~3 minutes: Cloud Build needs to build the image. Subsequent deploys are ~1 minute.

When it finishes you'll see:

Service URL: https://rag-xxxxxxxx-uc.a.run.app

Save that.

export SERVICE_URL=$(gcloud run services describe rag --region=$REGION --format='value(status.url)')

Hit Your Live Service

Health check first:

curl $SERVICE_URL/
# {"status":"ok"}

Re-ingest into the live database (the container ships with data/):

curl -X POST $SERVICE_URL/admin/ingest
# {"ingested": 17}

Ask a question:

curl -X POST $SERVICE_URL/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is RAG?"}' | jq

Same JSON shape as your local run in Step 6. Now coming from the public internet, backed by your Cloud SQL instance, generated by Vertex AI Gemini.

What's Actually Happening Underneath

Every auth hop uses the service account's IAM token. No passwords, no API keys, no shared secrets. The only sensitive thing that crosses the wire is the question itself.

Streaming Logs

When something looks wrong:

gcloud run services logs tail rag --region=$REGION

Hit your endpoint in another terminal and watch the requests roll in.

Cost in Production

Resource	At rest	With moderate traffic
Cloud Run	$0 (scales to zero)	~$0.40/M requests + per-second CPU/RAM
Cloud SQL `db-f1-micro`	~$8/mo (always-on) or ~$0.15/mo (stopped)	Same — billed by the second
Vertex AI embeddings	$0	$0.025 per 1M characters
Vertex AI Gemini Flash	$0	$0.075 per 1M input tokens, $0.30 per 1M output
Egress	$0	~$0 (same region as DB)

A bursty hobby app with a few hundred questions a day rounds to "the cost of leaving Cloud SQL running."

Update the Service

Anytime you change code, re-ingest data, or bump a dependency:

gcloud run deploy rag --source . --region=$REGION

Cloud Run does zero-downtime traffic shifting on every revision. Old container stays up until the new one passes health checks.

What You Have Now

A FastAPI app with three endpoints
A small Dockerfile that builds in seconds and runs anywhere
A dedicated service account with least-privilege IAM
A public HTTPS URL that answers questions from your documents

You shipped a RAG app on Google Cloud. Last step covers what comes next.

Reference: Deploy from source to Cloud Run · Cloud Run service identity · Cloud SQL connector with Cloud Run · IAM database authentication for service accounts