What to measure
Four numbers per format:
- On-disk size —
ls -lh/du -sh. Already visible: ~7.6GB → ~2.4GB. - VRAM at load — peak GPU memory holding the model + a short generation.
- Tokens/sec — decode throughput on a fixed prompt.
- Quality — a proxy you trust: perplexity on a held-out text, or accuracy on a small task set. Never skip this one.
A quick quality + speed probe
For the FP16 and AWQ (transformers-loadable) formats, a short script gives you VRAM, speed, and perplexity together:
# bench.py
import time, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def bench(path, quant=None, label=""):
torch.cuda.reset_peak_memory_stats()
tok = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(
path, device_map="cuda",
torch_dtype=torch.float16 if quant is None else "auto",
)
# speed: time 128 fresh tokens
ids = tok("Explain quantization.", return_tensors="pt").to("cuda")
t0 = time.time()
out = model.generate(**ids, max_new_tokens=128, do_sample=False)
dt = time.time() - t0
toks = out.shape[1] - ids.input_ids.shape[1]
# quality: perplexity on a fixed passage
text = open("wiki_sample.txt").read()[:4000]
enc = tok(text, return_tensors="pt").to("cuda")
with torch.no_grad():
ppl = torch.exp(model(**enc, labels=enc.input_ids).loss).item()
vram = torch.cuda.max_memory_allocated() / 1e9
print(f"{label:14} | vram {vram:4.1f}GB | {toks/dt:5.1f} tok/s | ppl {ppl:5.2f}")
del model; torch.cuda.empty_cache()
bench("models/qwen3-4b-fp16", label="FP16")
bench("models/qwen3-4b-awq", quant="awq", label="AWQ INT4")For the GGUF, llama.cpp has its own tools — llama-bench for tokens/sec and llama-perplexity for ppl:
./llama.cpp/build/bin/llama-bench -m models/qwen3-4b-Q4_K_M.gguf
./llama.cpp/build/bin/llama-perplexity -m models/qwen3-4b-Q4_K_M.gguf -f wiki_sample.txtThe table you're after
Filled in, it looks roughly like this (your numbers will vary by GPU):
| Format | Size | VRAM | tok/s | Perplexity |
|---|---|---|---|---|
| FP16 | 7.6GB | ~9.0GB | 38 | 8.10 |
| GGUF Q4_K_M | 2.4GB | ~3.3GB | 61 | 8.34 |
| AWQ INT4 | 2.5GB | ~3.6GB | 72 | 8.21 |
Read it the way the drip predicts: both 4-bit formats cut memory ~3× and run faster (decode is bandwidth-bound), and the quality cost is small — AWQ's perplexity sits closer to FP16 than the flat quant, because it protected the salient weights. Lower perplexity is better; a jump of a few tenths is typical and usually imperceptible in use.
The rule
If perplexity barely moves and VRAM drops 3×, ship the quant. If perplexity jumps a lot, step up a level (Q4_K_M → Q5_K_M, or check your AWQ calibration data). Measure, then decide — that's the entire discipline.
Reference: llama-bench · llama-perplexity · Quantization (drip)