The Goal
The companion drip, RLVR & Process Rewards, makes the argument: where correctness is checkable, you don't need a learned reward model — a verifier is a reward you can't game. This blueprint runs that loop end to end on a small model.
By the end you'll have:
- A 4-bit base model + LoRA adapters loaded with Unsloth, trainable on one GPU.
- A verifiable reward — extract the model's final answer, compare it to the gold answer from GSM8K. 1 if it matches, 0 if not. Plus a small format reward.
- A GRPO training loop (TRL's
GRPOTrainer) that samples a group of answers per question, scores them with the verifier, and pushes the policy toward the ones that check out. - A before/after eval: pass@1 on held-out grade-school math, so the lift is a number, not a vibe.
Why this is the whole RLVR recipe in miniature
Everything that makes 2026's reasoning models work is here, just small:
No learned reward model (the verifier is the reward). No value network (GRPO uses the group mean as the baseline). That's why this fits on one GPU.
What you'll need
| Choice | Why |
|---|---|
| Unsloth | 2× faster, ~50% less VRAM LoRA training, with fast vLLM-backed generation built in — GRPO samples a lot, so generation speed dominates. |
TRL GRPOTrainer | The reference GRPO implementation; you supply reward functions and it handles sampling, advantages, and the update. |
| A small instruct base (e.g. Qwen3-4B-Instruct or Llama-3.2-3B-Instruct) | Big enough to sometimes solve GSM8K (RLVR sharpens latent ability), small enough to train on a T4/L4. |
| GSM8K | Grade-school math word problems with clean gold answers — the canonical verifiable-reward dataset. |
A reality check up front
RLVR sharpens what the base model can already sometimes do — it raises pass@1 toward pass@k. On a 3–4B model you'll see a real, measurable jump on GSM8K (often low-double-digit points), not GPT-4-level math. That's the honest scope of a single-GPU run, and it's enough to see the mechanism work — which is the point.
The companion repo
Runnable version: github.com/maraja/train-a-reasoner-with-grpo. Follow the blueprint or clone and run train.py / eval.py.
What's coming
Eight steps:
- What we're building (you're here)
- Setup — GPU, Unsloth, TRL, load a 4-bit base + LoRA
- The dataset — GSM8K, prompt format, keep the gold answers
- The verifiable reward — extract the answer, compare, plus a format reward
- GRPO config — group size, generation, the trainer
- Train — run it, and what to watch on the reward curve
- Evaluate — pass@1 before vs after
- What's next — process rewards, stronger verifiers, scaling, export
Reference: RLVR & Process Rewards (drip) · Unsloth GRPO guide · TRL GRPOTrainer · GSM8K