Step 4: Deploy as an Open API

Deploy

From the directory with your Dockerfile. The first build bakes a 9.6 GB model into the image, which can run past Cloud Build's 10-minute default timeout — so raise it first:

gcloud config set builds/timeout 3600

Then deploy:

gcloud run deploy gemma \
  --source . \
  --region $REGION \
  --allow-unauthenticated \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --no-gpu-zonal-redundancy \
  --cpu 8 \
  --memory 32Gi \
  --no-cpu-throttling \
  --max-instances 1 \
  --concurrency 4 \
  --timeout 600 \
  --port 8080 \
  --startup-probe tcpSocket.port=8080,initialDelaySeconds=240,failureThreshold=1,timeoutSeconds=240,periodSeconds=240

What each flag does:

Flag	Why
`--source .`	Cloud Build packages your `Dockerfile`, pushes the image to Artifact Registry, and deploys it — no local Docker needed.
`--allow-unauthenticated`	The open switch. Grants the public the right to invoke the service. This is what lets a webpage call it with no token.
`--gpu 1 --gpu-type nvidia-l4`	Attaches one NVIDIA L4 (24 GB VRAM). That's the whole "add a GPU" story.
`--no-gpu-zonal-redundancy`	Uses the cheaper, self-serve GPU mode — and it's the mode the auto-granted quota covers, so your first deploy just works.
`--cpu 8 --memory 32Gi`	An L4 requires at least 4 CPU / 16 GiB; 8 / 32 is the recommended pairing for an LLM.
`--no-cpu-throttling`	Keeps the CPU allocated for the instance's whole life (required billing model for GPU).
`--max-instances 1`	Your cost ceiling. At most one GPU instance ever exists. The default is 100 — never leave that on a public GPU service.
`--concurrency 4`	One instance serves up to 4 requests at once (matches `OLLAMA_NUM_PARALLEL`).
`--timeout 600`	Allow up to 10 minutes per request, so long generations don't get cut off.
`--startup-probe ...`	The model takes time to load on boot; this long probe tells Cloud Run "don't declare it dead while it's still warming up."

Notice there are no --set-env-vars: OLLAMA_HOST, OLLAMA_ORIGINS, and the rest are baked into the image. The deploy command only provisions infrastructure and opens the door.

First Deploy Takes a While

The first deploy does three slow things: Cloud Build downloads the 9.6 GB model and bakes the image, pushes it to Artifact Registry, then Cloud Run provisions a GPU. Budget 10–15 minutes for the first run; later deploys are faster.

We raised builds/timeout above so this large bake doesn't hit the 10-minute default. If the build still times out, the builder was slow — just rerun the deploy.

When it finishes you'll see:

Service URL: https://gemma-xxxxxxxx-uc.a.run.app

Save it:

export SERVICE_URL=$(gcloud run services describe gemma --region $REGION --format='value(status.url)')
echo $SERVICE_URL

If `--allow-unauthenticated` Is Refused

On a personal Google account this just works. But if your account belongs to an organization (a Workspace/company domain), an admin may have turned on Domain Restricted Sharing — an org policy that forbids granting access to allUsers. The deploy then fails with an IAM policy error.

The clean workaround is to make the service public without an allUsers IAM grant, using the newer invoker-check flag (which Domain Restricted Sharing does not block):

gcloud run services update gemma --region $REGION --no-invoker-iam-check

If even that is blocked, your org has locked down public services on purpose — use a personal account for this demo, or follow Step 6 to run it privately with IAM instead.

Prove It's Live

Three checks. Run them in order.

1. Public and reachable — no auth header, expect 200. This very first request can wake a cold instance and wait through the startup probe, so it may take up to a minute (occasionally longer). Pass --max-time 300 so curl waits instead of looking like it hung:

curl -i --max-time 300 "$SERVICE_URL"
# HTTP/2 200 ... "Ollama is running"

A 403 here means the service is still private (the allUsers grant didn't take — see above).

2. CORS is on — the preflight a browser sends. Expect 204 with access-control-allow-origin: *:

curl -i -X OPTIONS "$SERVICE_URL/v1/chat/completions" \
  -H "Origin: https://example.com" \
  -H "Access-Control-Request-Method: POST" \
  -H "Access-Control-Request-Headers: content-type"

If access-control-allow-origin is missing, OLLAMA_ORIGINS didn't make it into the image — recheck the Dockerfile and redeploy.

3. Inference — the real thing:

curl "$SERVICE_URL/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4:e4b","messages":[{"role":"user","content":"Say hi in one short sentence."}],"stream":false}'

The first request after the service has scaled up wakes a GPU instance and loads the model — it can take 30–60 seconds. Every request after that is fast, and the instance stays warm until it's been idle for ~10–15 minutes.

You should get back a JSON object with choices[0].message.content — a reply from a Gemma model you're hosting, reachable by anyone on the internet.

What's Actually Happening

What You Have Now

A live Cloud Run service with an L4 GPU running Gemma 4
A public HTTPS URL that anyone can call with no key
Confirmed: it's reachable, CORS-ready, and answering

Next: the webpage that talks to it.

Reference: Deploy from source to Cloud Run · Cloud Run GPU configuration · Allowing public (unauthenticated) access · Domain Restricted Sharing