Train a Reasoner with GRPO

Teach a small model to reason with RLVR — GRPO against a verifiable math reward using Unsloth + TRL, LoRA on a single GPU, with pass@1 measured before and after so you can watch the accuracy line move. Companion build to the RLVR & Process Rewards drip.

← All blueprints Source code on GitHub →

Your progress0 / 8 steps· 0%

All steps

01Step 1: What We're BuildingA small instruct model fine-tuned to *reason* with RLVR — GRPO against a verifiable math reward, LoRA on a single GPU — with pass@1 measured before and after so you can watch the accuracy line actually move.3 min→02Step 2: SetupInstall Unsloth + TRL, load a small instruct model in 4-bit with fast generation, and attach LoRA adapters — the trainable surface for GRPO.1 min→03Step 3: The DatasetLoad GSM8K, wrap each question in a system prompt that asks for a `<reasoning>`/`<answer>` structure, and keep the gold numeric answer alongside — that gold value is what the verifier checks against.2 min→04Step 4: The Verifiable RewardThe heart of RLVR — reward functions that extract the model's final answer and compare it to the gold value (correctness), plus a small reward for following the format so the answer is always findable.2 min→05Step 5: GRPO ConfigWire the trainer — the group size (how many answers per question), generation limits, and the learning rate — into TRL's `GRPOConfig` and `GRPOTrainer`, passing both reward functions.2 min→06Step 6: TrainRun the loop, and read the two signals that tell you it's working — mean reward trending up and completion length settling into real reasoning rather than rambling.2 min→07Step 7: EvaluateMeasure pass@1 on held-out GSM8K with the adapters off and on — the same verifier you trained with, now used honestly on a test split — so the lift is a number.2 min→08Step 8: What's NextYou ran the full RLVR loop on one GPU and moved pass@1 with nothing but a verifier and GRPO. Here's how to push it further — and how to keep the checker honest.2 min→