Two moves: convert, then quantize
GGUF is a container; quantization is a separate step on top of it. First convert the Hugging Face checkpoint to a full-precision GGUF, then squeeze that down to a 4-bit quant level.
# 1. HF safetensors → GGUF (F16, still full size)
python llama.cpp/convert_hf_to_gguf.py \
models/qwen3-4b-fp16 \
--outfile models/qwen3-4b-f16.gguf \
--outtype f16
# 2. quantize the GGUF to Q4_K_M
./llama.cpp/build/bin/llama-quantize \
models/qwen3-4b-f16.gguf \
models/qwen3-4b-Q4_K_M.gguf \
Q4_K_M$ ls -lh models/*.gguf
7.6G qwen3-4b-f16.gguf
2.4G qwen3-4b-Q4_K_M.ggufRoughly a 3× smaller file, on disk, in one command.
Which quant level?
llama-quantize offers a whole menu. The ones worth knowing:
| Level | Bits (approx) | Use it when |
|---|---|---|
| Q8_0 | 8 | You want near-lossless and have the RAM. |
| Q5_K_M | ~5.5 | Quality-first 4-bit-ish; a common "safe" pick. |
| Q4_K_M | ~4.5 | The default sweet spot — best size/quality tradeoff for most local use. |
| Q3_K_M | ~3.5 | Squeeze onto tiny hardware; quality starts to show. |
Q4_K_M is the one to reach for first — the _K_M variants use k-quants that (like AWQ) spend more bits on the parts of each layer that matter, so they beat a flat 4-bit round. Try Q5_K_M too and let Step 5's benchmark decide.
Quick smoke test
./llama.cpp/build/bin/llama-cli \
-m models/qwen3-4b-Q4_K_M.gguf \
-p "Explain quantization in one sentence." -n 80 --no-display-promptYou should get a coherent sentence from a 2.4GB file that started at 7.6GB. That's the whole GGUF path — and it's why Ollama and LM Studio ship models this way. Next, the AWQ path for GPU serving.
Reference: llama.cpp quantize · GGUF k-quants · Quantization — §03 The formats (drip)