The Goal
By the end of this blueprint you will have:
- A Cloud Run service with an NVIDIA L4 GPU attached, running Gemma 4 behind a single HTTPS URL.
- That URL deployed open and unauthenticated — anyone, including a webpage's JavaScript, can call it with no token.
- A one-file webpage (
index.html, no framework, no build step) that POSTs a message and streams the reply token by token.
You will end up calling your own Cloud Run URL like this:
curl https://gemma-xxxxxxxx-uc.a.run.app/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gemma4:e4b","messages":[{"role":"user","content":"Explain CORS in one sentence."}]}'…and getting an answer back from a model you are hosting — not OpenAI, not Anthropic, not Google's API. Then you'll open a webpage that does the same thing, live, in front of you.
Architecture
One container. One GPU. One public URL. The browser talks to it directly — there is no backend of your own in the middle.
Why This Stack
| Choice | Why |
|---|---|
| Gemma 4 | Google's open-weights model (Apache 2.0 — full commercial freedom). It runs entirely on your GPU; there's no upstream API to authenticate against or get rate-limited by. |
| Ollama | The simplest way to serve an open model over HTTP. It exposes an OpenAI-compatible API (/v1/chat/completions) and handles CORS for you with one environment variable. |
| Cloud Run + L4 GPU | A GPU is now one flag on Cloud Run (--gpu 1 --gpu-type nvidia-l4). It scales to zero when idle, so you only pay while it's actually serving — and it's one deploy command, no Kubernetes. |
--allow-unauthenticated | Makes the URL callable by anyone, which is exactly what a browser needs (browser JS can't hold a private Google credential safely). |
OLLAMA_ORIGINS | The single setting that lets a webpage on a different origin call the service. No proxy, no CORS middleware of your own. |
A static index.html | No server, no framework. Open it locally or drop it on GitHub Pages. The whole frontend is ~40 lines. |
What is not in this stack: a Python backend, a vector database, an API gateway, an auth system, a Docker install on your laptop. The model server is the backend, and the browser calls it directly.
What "Open and Unauthenticated" Really Means
This is the part most tutorials skip, so let's be honest up front.
--allow-unauthenticated means anyone who knows your URL can run inference on your GPU, and that GPU costs real money per second while it's awake. That's the price of letting a public webpage call it with no key. It's perfect for a demo, a workshop, or a personal project — and a bad idea to leave running unattended.
This blueprint deliberately diverges from Google's official Gemma-on-Cloud-Run tutorial, which deploys privately (--no-allow-unauthenticated). We're going open on purpose because the goal is "call it from a webpage with zero auth." We bound the risk with a hard one-instance cap, scale-to-zero, and a teardown step — and Step 6 covers the cost math, the abuse mitigations, and how to lock it down for production.
The Companion Repo
Every file below is also in a runnable repo: github.com/maraja/deploy-gemma-to-cloud-run. It's tiny on purpose — a Dockerfile, a deploy.sh, and web/index.html. Clone it and run two commands, or follow the blueprint and type it out yourself. Same result either way.
git clone https://github.com/maraja/deploy-gemma-to-cloud-run.git
cd deploy-gemma-to-cloud-runWhat's Coming
Six short steps:
- What we're building (you're here)
- Set up Google Cloud — install
gcloud, create a project, enable three APIs, pick a GPU region - Bake Gemma into a container — a six-line
Dockerfilethat bundles the model and turns on CORS - Deploy as an open API — one
gcloud run deploywith a GPU and--allow-unauthenticated, then prove it's live - Call it from a webpage — a single streaming
index.html, hosted anywhere - Costs, safety & teardown — what it costs, how to bound abuse, and the commands that take it all back to zero
Cost Heads-Up
Cloud Run GPUs are billed per second while an instance is awake, and there is no always-free GPU tier — every GPU-second costs from the first one. The good news: the service scales to zero, so once it's been idle for ~10–15 minutes it shuts down and the GPU cost stops.
A realistic build-and-demo session — deploy, run a few dozen requests, tear it down — costs in the low single-dollar range. Leaving an L4 instance pinned awake 24/7 would run roughly $800+/month, which is why Step 6 ends with a teardown command. Set a budget alert before you start.
Reference: Gemma 4 model card · Cloud Run GPU overview · Run Gemma on Cloud Run with Ollama (official tutorial) · Ollama OpenAI compatibility