// blueprint · working codeest. 2026 · no ads · no tracking
Blueprint · intermediate · 7 steps

Quantize & Run a Model

Take one Hugging Face model, quantize it two ways — GGUF (llama.cpp) and AWQ — benchmark VRAM, speed, and quality side by side, then serve the winner via Ollama or vLLM. Companion build to the Quantization drip.

← All blueprintsSource code on GitHub →
Your progress0 / 7 steps· 0%

All steps

01Step 1: What We're BuildingTake one Hugging Face model, quantize it two ways — GGUF (llama.cpp, for local) and AWQ (INT4, for GPU serving) — benchmark VRAM, speed, and quality side by side, then serve the winner.2 min02Step 2: SetupBuild llama.cpp (for the GGUF path), install AutoAWQ (for the GPU path), and download the FP16 base model once — both paths quantize from the same local checkpoint.1 min03Step 3: Quantize to GGUFConvert the FP16 checkpoint to a GGUF file, then quantize it to Q4_K_M — the local-inference sweet spot — with llama.cpp. No GPU required.1 min04Step 4: Quantize to AWQRun activation-aware INT4 quantization with AutoAWQ — it calibrates on a little real text to find the salient weights, protects them, and quantizes the rest. This is the GPU-serving path.2 min05Step 5: BenchmarkThe point of the whole build — put FP16, GGUF Q4_K_M, and AWQ side by side and measure on-disk size, VRAM, tokens/sec, and a quality proxy, so the choice is a table, not a guess.2 min06Step 6: Serve the WinnerPut the quant behind an OpenAI-compatible endpoint — Ollama for the GGUF (local), vLLM for the AWQ (GPU serving) — and call it like any other model.2 min07Step 7: What's NextYou shrank one model 3×, kept its quality, and served it two ways. Here's how to push further — smaller quants, KV-cache quantization, and better calibration — plus the habit that keeps you out of trouble.2 min