Step 1: What We're Building — Deploy Gemma as an Open API on Cloud Run

The Goal

By the end of this blueprint you will have:

A Cloud Run service with an NVIDIA L4 GPU attached, running Gemma 4 behind a single HTTPS URL.
That URL deployed open and unauthenticated — anyone, including a webpage's JavaScript, can call it with no token.
A one-file webpage (index.html, no framework, no build step) that POSTs a message and streams the reply token by token.

You will end up calling your own Cloud Run URL like this:

curl https://gemma-xxxxxxxx-uc.a.run.app/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4:e4b","messages":[{"role":"user","content":"Explain CORS in one sentence."}]}'

…and getting an answer back from a model you are hosting — not OpenAI, not Anthropic, not Google's API. Then you'll open a webpage that does the same thing, live, in front of you.

Architecture

One container. One GPU. One public URL. The browser talks to it directly — there is no backend of your own in the middle.

Why This Stack

Choice	Why
Gemma 4	Google's open-weights model (Apache 2.0 — full commercial freedom). It runs entirely on your GPU; there's no upstream API to authenticate against or get rate-limited by.
Ollama	The simplest way to serve an open model over HTTP. It exposes an OpenAI-compatible API (`/v1/chat/completions`) and handles CORS for you with one environment variable.
Cloud Run + L4 GPU	A GPU is now one flag on Cloud Run (`--gpu 1 --gpu-type nvidia-l4`). It scales to zero when idle, so you only pay while it's actually serving — and it's one deploy command, no Kubernetes.
`--allow-unauthenticated`	Makes the URL callable by anyone, which is exactly what a browser needs (browser JS can't hold a private Google credential safely).
`OLLAMA_ORIGINS`	The single setting that lets a webpage on a different origin call the service. No proxy, no CORS middleware of your own.
A static `index.html`	No server, no framework. Open it locally or drop it on GitHub Pages. The whole frontend is ~40 lines.

What is not in this stack: a Python backend, a vector database, an API gateway, an auth system, a Docker install on your laptop. The model server is the backend, and the browser calls it directly.

What "Open and Unauthenticated" Really Means

This is the part most tutorials skip, so let's be honest up front.

--allow-unauthenticated means anyone who knows your URL can run inference on your GPU, and that GPU costs real money per second while it's awake. That's the price of letting a public webpage call it with no key. It's perfect for a demo, a workshop, or a personal project — and a bad idea to leave running unattended.

This blueprint deliberately diverges from Google's official Gemma-on-Cloud-Run tutorial, which deploys privately (--no-allow-unauthenticated). We're going open on purpose because the goal is "call it from a webpage with zero auth." We bound the risk with a hard one-instance cap, scale-to-zero, and a teardown step — and Step 6 covers the cost math, the abuse mitigations, and how to lock it down for production.

The Companion Repo

Every file below is also in a runnable repo: github.com/maraja/deploy-gemma-to-cloud-run. It's tiny on purpose — a Dockerfile, a deploy.sh, and web/index.html. Clone it and run two commands, or follow the blueprint and type it out yourself. Same result either way.

git clone https://github.com/maraja/deploy-gemma-to-cloud-run.git
cd deploy-gemma-to-cloud-run

What's Coming

Six short steps:

What we're building (you're here)
Set up Google Cloud — install gcloud, create a project, enable three APIs, pick a GPU region
Bake Gemma into a container — a six-line Dockerfile that bundles the model and turns on CORS
Deploy as an open API — one gcloud run deploy with a GPU and --allow-unauthenticated, then prove it's live
Call it from a webpage — a single streaming index.html, hosted anywhere
Costs, safety & teardown — what it costs, how to bound abuse, and the commands that take it all back to zero

Cost Heads-Up

Cloud Run GPUs are billed per second while an instance is awake, and there is no always-free GPU tier — every GPU-second costs from the first one. The good news: the service scales to zero, so once it's been idle for ~10–15 minutes it shuts down and the GPU cost stops.

A realistic build-and-demo session — deploy, run a few dozen requests, tear it down — costs in the low single-dollar range. Leaving an L4 instance pinned awake 24/7 would run roughly $800+/month, which is why Step 6 ends with a teardown command. Set a budget alert before you start.

Reference: Gemma 4 model card · Cloud Run GPU overview · Run Gemma on Cloud Run with Ollama (official tutorial) · Ollama OpenAI compatibility