What an open GPU endpoint actually costs, how to keep a stranger from running up your bill, how to lock it down for production, and the two commands that take it all back to zero.
Before this
A deployed service from Step 4
What It Costs
Cloud Run GPU is billed per second while an instance is awake, and there is no always-free GPU tier — the free monthly allotment covers CPU, memory, and requests, but every GPU-second is billed from the first one.
State
Cost
Idle (scaled to zero)
$0 — no instance, no GPU, no charge
Awake and serving
~$1.40/active-hour — roughly $0.67 GPU + ~$0.75 for the 8 vCPU / 32 GiB alongside it
Pinned awake 24/7
~$800–1,000/month (don't do this for a demo)
The whole point of scale-to-zero: a GPU instance shuts down after a short idle period (roughly 10–15 minutes) with no traffic. So a build-test-demo-teardown session — even with a few hundred requests — lands in the low single-dollar range. The danger isn't normal use; it's an instance left awake, or one woken repeatedly by traffic you didn't expect.
Per-second prices vary by region and change over time. Check the live Cloud Run pricing page for exact figures before relying on them.
The Open-Endpoint Risk
--allow-unauthenticated means anyone with the URL can run inference on your GPU. There's no key to leak because there's no key at all. For a workshop or a personal demo that's fine — but understand what you've exposed, and bound it.
The guardrails you already have, and what each actually does:
--max-instances 1 — your hard ceiling. No matter how much traffic arrives, at most one GPU instance ever exists. This is the single most important cost control. (Cloud Run's default is 100; never leave that on a public GPU service.)
Scale-to-zero — as long as you don't set --min-instances, idle traffic costs nothing and abuse stops billing a short while (~10–15 minutes) after the last request.
--timeout + --concurrency — cap how long and how many requests a single instance handles, limiting the blast radius of one abuser.
A budget alert — set one now:
# In the console: Billing → Budgets & alerts → Create budget
But know its limit: a budget alert only emails you after spend crosses a threshold. It does not cap or stop anything. A real hard stop requires wiring the budget to a Pub/Sub topic and a function that disables billing — out of scope here, but Google documents it.
What doesn't help: a shared "API key" checked in your page's JavaScript. Anyone can read it straight from the page source, so it stops nobody. Real auth has to live server-side — which means the next section.
Locking It Down for Production
When you're past the demo, pick one:
Option A — make it private again (simplest). Drop public access and require a Google identity token. Now only callers you grant roles/run.invoker can reach it:
gcloud run services update gemma --region $REGION --no-allow-unauthenticatedgcloud run services add-iam-policy-binding gemma --region $REGION \ --member="user:you@gmail.com" --role="roles/run.invoker"# Then call it with a token:curl -H "Authorization: Bearer $(gcloud auth print-identity-token)" \ "$SERVICE_URL/v1/chat/completions" -d '{...}'
A browser can't hold that token safely, so this means putting a small backend of your own in front — which is the honest answer for a real product.
Option B — keep it browser-callable but guarded. Front the service with an external HTTPS load balancer, lock direct *.run.app ingress (--ingress=internal-and-cloud-load-balancing), and attach Cloud Armor rate-limiting rules. More moving parts, but it keeps the public-browser shape while throttling abuse.
For most "I just want a demo" cases, neither is needed — you tear it down instead.
Teardown
Opened a new terminal since Step 4? Re-set your vars first:
# 1. Delete the service — stops all GPU/CPU billing immediately.gcloud run services delete gemma --region $REGION --quiet# 2. Delete the image so it stops costing Artifact Registry storage.# (Cloud Run's --source build creates a repo named "cloud-run-source-deploy".)gcloud artifacts repositories delete cloud-run-source-deploy \ --location $REGION --quiet
Or nuke everything — service, image, project, the lot — in one shot:
gcloud projects delete $PROJECT_ID
That sends the project to a 30-day recovery window (undo with gcloud projects undelete), then it's gone for good. It's the cleanest "I'm done" button there is.
Where to Go From Here
Swap the model size. Change gemma4:e4b to gemma4:e2b (lighter/faster) in the Dockerfile and your requests, then redeploy. Same shape, different trade-off.
Higher throughput? When one Ollama instance isn't enough, vLLM serves the same OpenAI-compatible API with much higher concurrency. Google ships a prebuilt vLLM image for Gemma. The webpage doesn't change.
One-click path.Google AI Studio has a "Deploy to Cloud Run" button for Gemma that does a version of all this for you — handy once you understand what it's doing under the hood.
A real frontend. The streaming loop in Step 5 drops straight into a React/Next.js component — same fetch + getReader logic.
Key Takeaways
A GPU is one flag on Cloud Run.--gpu 1 --gpu-type nvidia-l4 is the whole story; the rest is a normal deploy.
Bake the model into the image for fast, repeatable cold starts — viable for any Gemma variant under ~10 GB.
OLLAMA_ORIGINS is the CORS switch that makes a browser call work with no proxy. --allow-unauthenticated is the open switch. Together they're the entire "callable from a webpage" trick.
Open + GPU = real money exposed.--max-instances 1 and scale-to-zero bound it; a budget alert only warns you; teardown is the true off switch.
The blueprint is the unit of work. What you built in six steps — open model, serverless GPU, public URL, streaming webpage — is a complete, demoable system. Everything above is where you take it next.