Step 6: Serve the Winner — Quantize & Run a Model

GGUF → Ollama (local)

Ollama reads GGUF directly. Point a Modelfile at your quantized file and create a named model:

cat > Modelfile <<'EOF'
FROM ./models/qwen3-4b-Q4_K_M.gguf
PARAMETER temperature 0.7
EOF
 
ollama create qwen3-4b-q4 -f Modelfile
ollama run qwen3-4b-q4 "Give me three uses for a local model."

That's it — the model is now a first-class Ollama model with the same OpenAI-compatible API as any other. (The Ollama blueprint covers calling it over HTTP.)

AWQ → vLLM (GPU serving)

vLLM detects AWQ from the saved config and serves it with an OpenAI-compatible API:

pip install vllm
vllm serve models/qwen3-4b-awq --quantization awq --port 8000

$ curl -s localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"models/qwen3-4b-awq",
       "messages":[{"role":"user","content":"hi"}]}' | jq -r .choices[0].message.content
Hey — what are we building?

Same request shape as OpenAI, so any client library points at it by changing the base URL.

Which one to serve

It follows the drip's rule of thumb, now grounded in your own benchmark:

Local / laptop / CPU-friendly / one user → the GGUF via Ollama. Simplest, runs anywhere, great single-stream speed.
GPU box / many concurrent users / throughput matters → the AWQ via vLLM. Its continuous batching and paged KV cache make it the throughput winner under load — where FP16 wouldn't even fit.

Both expose the identical API, so you can develop against Ollama locally and deploy AWQ+vLLM in production without changing a line of client code.

The whole loop, closed

You took one FP16 checkpoint, quantized it two ways, measured that the quality cost was small and the memory win was large, and served the result behind a standard endpoint. That's quantization in practice: not a theory about bits, but a model that now fits — and runs faster — where it couldn't before.

Reference: Ollama Modelfile · vLLM AWQ serving · Run an Open Model with Ollama (blueprint)