Step 3: Quantize to GGUF — Quantize & Run a Model

Two moves: convert, then quantize

GGUF is a container; quantization is a separate step on top of it. First convert the Hugging Face checkpoint to a full-precision GGUF, then squeeze that down to a 4-bit quant level.

# 1. HF safetensors → GGUF (F16, still full size)
python llama.cpp/convert_hf_to_gguf.py \
  models/qwen3-4b-fp16 \
  --outfile models/qwen3-4b-f16.gguf \
  --outtype f16
 
# 2. quantize the GGUF to Q4_K_M
./llama.cpp/build/bin/llama-quantize \
  models/qwen3-4b-f16.gguf \
  models/qwen3-4b-Q4_K_M.gguf \
  Q4_K_M

$ ls -lh models/*.gguf
7.6G  qwen3-4b-f16.gguf
2.4G  qwen3-4b-Q4_K_M.gguf

Roughly a 3× smaller file, on disk, in one command.

Which quant level?

llama-quantize offers a whole menu. The ones worth knowing:

Level	Bits (approx)	Use it when
Q8_0	8	You want near-lossless and have the RAM.
Q5_K_M	~5.5	Quality-first 4-bit-ish; a common "safe" pick.
Q4_K_M	~4.5	The default sweet spot — best size/quality tradeoff for most local use.
Q3_K_M	~3.5	Squeeze onto tiny hardware; quality starts to show.

Q4_K_M is the one to reach for first — the _K_M variants use k-quants that (like AWQ) spend more bits on the parts of each layer that matter, so they beat a flat 4-bit round. Try Q5_K_M too and let Step 5's benchmark decide.

Quick smoke test

./llama.cpp/build/bin/llama-cli \
  -m models/qwen3-4b-Q4_K_M.gguf \
  -p "Explain quantization in one sentence." -n 80 --no-display-prompt

You should get a coherent sentence from a 2.4GB file that started at 7.6GB. That's the whole GGUF path — and it's why Ollama and LM Studio ship models this way. Next, the AWQ path for GPU serving.

Reference: llama.cpp quantize · GGUF k-quants · Quantization — §03 The formats (drip)