Step 5: Benchmark — Quantize & Run a Model

What to measure

Four numbers per format:

On-disk size — ls -lh / du -sh. Already visible: ~7.6GB → ~2.4GB.
VRAM at load — peak GPU memory holding the model + a short generation.
Tokens/sec — decode throughput on a fixed prompt.
Quality — a proxy you trust: perplexity on a held-out text, or accuracy on a small task set. Never skip this one.

A quick quality + speed probe

For the FP16 and AWQ (transformers-loadable) formats, a short script gives you VRAM, speed, and perplexity together:

# bench.py
import time, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
def bench(path, quant=None, label=""):
    torch.cuda.reset_peak_memory_stats()
    tok = AutoTokenizer.from_pretrained(path)
    model = AutoModelForCausalLM.from_pretrained(
        path, device_map="cuda",
        torch_dtype=torch.float16 if quant is None else "auto",
    )
    # speed: time 128 fresh tokens
    ids = tok("Explain quantization.", return_tensors="pt").to("cuda")
    t0 = time.time()
    out = model.generate(**ids, max_new_tokens=128, do_sample=False)
    dt = time.time() - t0
    toks = out.shape[1] - ids.input_ids.shape[1]
    # quality: perplexity on a fixed passage
    text = open("wiki_sample.txt").read()[:4000]
    enc = tok(text, return_tensors="pt").to("cuda")
    with torch.no_grad():
        ppl = torch.exp(model(**enc, labels=enc.input_ids).loss).item()
    vram = torch.cuda.max_memory_allocated() / 1e9
    print(f"{label:14} | vram {vram:4.1f}GB | {toks/dt:5.1f} tok/s | ppl {ppl:5.2f}")
    del model; torch.cuda.empty_cache()
 
bench("models/qwen3-4b-fp16", label="FP16")
bench("models/qwen3-4b-awq", quant="awq", label="AWQ INT4")

For the GGUF, llama.cpp has its own tools — llama-bench for tokens/sec and llama-perplexity for ppl:

./llama.cpp/build/bin/llama-bench -m models/qwen3-4b-Q4_K_M.gguf
./llama.cpp/build/bin/llama-perplexity -m models/qwen3-4b-Q4_K_M.gguf -f wiki_sample.txt

The table you're after

Filled in, it looks roughly like this (your numbers will vary by GPU):

Format	Size	VRAM	tok/s	Perplexity
FP16	7.6GB	~9.0GB	38	8.10
GGUF Q4_K_M	2.4GB	~3.3GB	61	8.34
AWQ INT4	2.5GB	~3.6GB	72	8.21

Read it the way the drip predicts: both 4-bit formats cut memory ~3× and run faster (decode is bandwidth-bound), and the quality cost is small — AWQ's perplexity sits closer to FP16 than the flat quant, because it protected the salient weights. Lower perplexity is better; a jump of a few tenths is typical and usually imperceptible in use.

The rule

If perplexity barely moves and VRAM drops 3×, ship the quant. If perplexity jumps a lot, step up a level (Q4_K_M → Q5_K_M, or check your AWQ calibration data). Measure, then decide — that's the entire discipline.

Reference: llama-bench · llama-perplexity · Quantization (drip)