The Endpoint
Ollama exposes an OpenAI-compatible API, so the request shape is the one every LLM tutorial already uses:
POST https://gemma-xxxxxxxx-uc.a.run.app/v1/chat/completions
Content-Type: application/json
{ "model": "gemma4:e4b", "messages": [...], "stream": true }Two things to get right:
- The
/v1prefix is mandatory. Posting to the bare host gives a 404. modelmust be the exact tag you baked (gemma4:e4b), or you get404 model not found.
With stream: true, the response is Server-Sent Events: a series of data: {json} lines, each carrying a token at choices[0].delta.content, ending with a literal data: [DONE].
The Page
Create web/index.html:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Ask Gemma</title>
<style>
body { font-family: system-ui, sans-serif; max-width: 640px; margin: 3rem auto; padding: 0 1rem; }
h1 { font-weight: 650; }
.row { display: flex; gap: .5rem; }
input { flex: 1; padding: .6rem .8rem; font-size: 1rem; border: 1px solid #ccc; border-radius: 8px; }
button { padding: .6rem 1.1rem; font-size: 1rem; border: 0; border-radius: 8px; background: #111; color: #fff; cursor: pointer; }
button:disabled { opacity: .5; cursor: default; }
#out { white-space: pre-wrap; margin-top: 1.25rem; padding: 1rem; min-height: 3rem;
background: #f6f6f6; border-radius: 8px; line-height: 1.5; }
</style>
</head>
<body>
<h1>Ask Gemma 💎</h1>
<div class="row">
<input id="q" value="Write a haiku about serverless GPUs." />
<button id="send">Send</button>
</div>
<div id="out"></div>
<script>
// ── Edit these two lines ────────────────────────────────
const ENDPOINT = "https://SERVICE-URL/v1/chat/completions"; // your Cloud Run URL + /v1/chat/completions
const MODEL = "gemma4:e4b"; // must match the tag you baked
// ────────────────────────────────────────────────────────
const q = document.getElementById("q");
const out = document.getElementById("out");
const send = document.getElementById("send");
send.onclick = async () => {
send.disabled = true;
out.textContent = "";
try {
const res = await fetch(ENDPOINT, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: MODEL,
messages: [{ role: "user", content: q.value }],
stream: true,
}),
});
if (!res.ok) { out.textContent = `HTTP ${res.status}: ${await res.text()}`; return; }
// EventSource can't POST, so read the response stream and parse SSE ourselves.
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop(); // keep the trailing partial line
for (const line of lines) {
const t = line.trim();
if (!t.startsWith("data:")) continue;
const payload = t.slice(5).trim();
if (payload === "[DONE]") return;
try {
const delta = JSON.parse(payload).choices?.[0]?.delta?.content;
if (delta) out.textContent += delta;
} catch { /* ignore keep-alives / partial chunks */ }
}
}
} finally {
send.disabled = false;
}
};
</script>
</body>
</html>Replace SERVICE-URL with your Cloud Run host and you're done. That's the entire frontend.
How the Streaming Works
The interesting part is the loop that reads the stream:
fetch(..., { stream: true })returns before the model is finished.res.body.getReader()gives a stream of bytes as they arrive.- We can't use the browser's built-in
EventSourcebecause it only doesGET— and we need toPOSTthe messages. So we read the raw stream and split SSE lines ourselves. - Chunks can split mid-line, so we buffer on
\nand keep the last partial line for the next read. Each completedata:line is one token's worth of JSON; we pullchoices[0].delta.contentand append it. The text appears word by word.
Why CORS Just Works
When the page calls the service from a different origin, the browser first sends a preflight OPTIONS request. Because you baked OLLAMA_ORIGINS=* into the image, Ollama answers it with Access-Control-Allow-Origin: *, and the browser proceeds to the real POST. No proxy, no CORS code of your own. (You already confirmed this with check #2 in Step 4.)
Host It Anywhere
The page is fully static, so it can live wherever:
- Open it locally. Double-click
index.html. Afile://page sendsOrigin: null, and because you bakedOLLAMA_ORIGINS=*(which matchesnull), the service still accepts it — the fastest way to test. (If you later lockOLLAMA_ORIGINSto a specifichttps://origin,file://stops working; host the page instead.) - GitHub Pages. Push the repo, enable Pages, and it's live at
https://you.github.io/your-repo. Free and public. - Vercel / Netlify / Cloudflare Pages. Drag the folder in. Equivalent.
All of these serve over HTTPS, and Cloud Run is always HTTPS — so there's no mixed-content problem either way.
Lock CORS to Your Site (Optional)
OLLAMA_ORIGINS=* lets any page call your service. Once you know where the page lives, narrow it to that exact origin (scheme + host, no trailing slash, no path) and redeploy:
# Edit the Dockerfile: ENV OLLAMA_ORIGINS=https://you.github.io
gcloud run deploy gemma --source . --region $REGIONNow only your page can make browser calls. (It doesn't stop a determined caller with curl — that's what Step 6 is about — but it stops other people's webpages from quietly using your GPU.)
What You Have Now
- A one-file webpage that streams answers from your model
- Confirmed end-to-end: browser → open Cloud Run GPU → Gemma → streamed tokens, with zero backend of your own
You've built the whole thing. Last step: keep it from costing you anything you didn't intend.
Reference: Ollama OpenAI compatibility · Using the Fetch API & ReadableStream (MDN) · Server-Sent Events (MDN) · GitHub Pages quickstart