The Endpoint

Ollama exposes an OpenAI-compatible API, so the request shape is the one every LLM tutorial already uses:

POST https://gemma-xxxxxxxx-uc.a.run.app/v1/chat/completions
Content-Type: application/json
 
{ "model": "gemma4:e4b", "messages": [...], "stream": true }

Two things to get right:

  • The /v1 prefix is mandatory. Posting to the bare host gives a 404.
  • model must be the exact tag you baked (gemma4:e4b), or you get 404 model not found.

With stream: true, the response is Server-Sent Events: a series of data: {json} lines, each carrying a token at choices[0].delta.content, ending with a literal data: [DONE].

The Page

Create web/index.html:

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1" />
  <title>Ask Gemma</title>
  <style>
    body { font-family: system-ui, sans-serif; max-width: 640px; margin: 3rem auto; padding: 0 1rem; }
    h1 { font-weight: 650; }
    .row { display: flex; gap: .5rem; }
    input { flex: 1; padding: .6rem .8rem; font-size: 1rem; border: 1px solid #ccc; border-radius: 8px; }
    button { padding: .6rem 1.1rem; font-size: 1rem; border: 0; border-radius: 8px; background: #111; color: #fff; cursor: pointer; }
    button:disabled { opacity: .5; cursor: default; }
    #out { white-space: pre-wrap; margin-top: 1.25rem; padding: 1rem; min-height: 3rem;
           background: #f6f6f6; border-radius: 8px; line-height: 1.5; }
  </style>
</head>
<body>
  <h1>Ask Gemma 💎</h1>
  <div class="row">
    <input id="q" value="Write a haiku about serverless GPUs." />
    <button id="send">Send</button>
  </div>
  <div id="out"></div>
 
  <script>
    // ── Edit these two lines ────────────────────────────────
    const ENDPOINT = "https://SERVICE-URL/v1/chat/completions"; // your Cloud Run URL + /v1/chat/completions
    const MODEL = "gemma4:e4b";                                 // must match the tag you baked
    // ────────────────────────────────────────────────────────
 
    const q = document.getElementById("q");
    const out = document.getElementById("out");
    const send = document.getElementById("send");
 
    send.onclick = async () => {
      send.disabled = true;
      out.textContent = "";
      try {
        const res = await fetch(ENDPOINT, {
          method: "POST",
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify({
            model: MODEL,
            messages: [{ role: "user", content: q.value }],
            stream: true,
          }),
        });
        if (!res.ok) { out.textContent = `HTTP ${res.status}: ${await res.text()}`; return; }
 
        // EventSource can't POST, so read the response stream and parse SSE ourselves.
        const reader = res.body.getReader();
        const decoder = new TextDecoder();
        let buffer = "";
        while (true) {
          const { value, done } = await reader.read();
          if (done) break;
          buffer += decoder.decode(value, { stream: true });
          const lines = buffer.split("\n");
          buffer = lines.pop();                 // keep the trailing partial line
          for (const line of lines) {
            const t = line.trim();
            if (!t.startsWith("data:")) continue;
            const payload = t.slice(5).trim();
            if (payload === "[DONE]") return;
            try {
              const delta = JSON.parse(payload).choices?.[0]?.delta?.content;
              if (delta) out.textContent += delta;
            } catch { /* ignore keep-alives / partial chunks */ }
          }
        }
      } finally {
        send.disabled = false;
      }
    };
  </script>
</body>
</html>

Replace SERVICE-URL with your Cloud Run host and you're done. That's the entire frontend.

How the Streaming Works

The interesting part is the loop that reads the stream:

  • fetch(..., { stream: true }) returns before the model is finished. res.body.getReader() gives a stream of bytes as they arrive.
  • We can't use the browser's built-in EventSource because it only does GET — and we need to POST the messages. So we read the raw stream and split SSE lines ourselves.
  • Chunks can split mid-line, so we buffer on \n and keep the last partial line for the next read. Each complete data: line is one token's worth of JSON; we pull choices[0].delta.content and append it. The text appears word by word.

Why CORS Just Works

When the page calls the service from a different origin, the browser first sends a preflight OPTIONS request. Because you baked OLLAMA_ORIGINS=* into the image, Ollama answers it with Access-Control-Allow-Origin: *, and the browser proceeds to the real POST. No proxy, no CORS code of your own. (You already confirmed this with check #2 in Step 4.)

Host It Anywhere

The page is fully static, so it can live wherever:

  • Open it locally. Double-click index.html. A file:// page sends Origin: null, and because you baked OLLAMA_ORIGINS=* (which matches null), the service still accepts it — the fastest way to test. (If you later lock OLLAMA_ORIGINS to a specific https:// origin, file:// stops working; host the page instead.)
  • GitHub Pages. Push the repo, enable Pages, and it's live at https://you.github.io/your-repo. Free and public.
  • Vercel / Netlify / Cloudflare Pages. Drag the folder in. Equivalent.

All of these serve over HTTPS, and Cloud Run is always HTTPS — so there's no mixed-content problem either way.

Lock CORS to Your Site (Optional)

OLLAMA_ORIGINS=* lets any page call your service. Once you know where the page lives, narrow it to that exact origin (scheme + host, no trailing slash, no path) and redeploy:

# Edit the Dockerfile: ENV OLLAMA_ORIGINS=https://you.github.io
gcloud run deploy gemma --source . --region $REGION

Now only your page can make browser calls. (It doesn't stop a determined caller with curl — that's what Step 6 is about — but it stops other people's webpages from quietly using your GPU.)

What You Have Now

  • A one-file webpage that streams answers from your model
  • Confirmed end-to-end: browser → open Cloud Run GPU → Gemma → streamed tokens, with zero backend of your own

You've built the whole thing. Last step: keep it from costing you anything you didn't intend.


Reference: Ollama OpenAI compatibility · Using the Fetch API & ReadableStream (MDN) · Server-Sent Events (MDN) · GitHub Pages quickstart