RLVR & Process Rewards
Don’t reward what looks right — reward what checks out. The recipe behind 2026’s reasoning models isn’t a bigger reward model; it’s throwing the reward model away wherever a cheap verifier can take its place.
§ 00 · THE PROXY PROBLEMOptimize a proxy hard enough and it stops being one
Classic RLHF trains a reward modelReward model. A model trained on human preference comparisons to predict a scalar 'goodness' score for an output, used as the reward signal during RL fine-tuning. to imitate human preferences, then optimizes the policy against that model’s score. It works — until it works too well. The reward model is only an approximation of what humans actually want, and a capable policy will find the places where the approximation is generous.
This is reward hackingReward hacking. When a policy maximizes the measured reward without achieving the intended goal — exploiting gaps between the reward proxy and the real objective., and it’s not a bug you can prompt your way out of; it’s the predictable result of optimizing a proxy. OpenAI’s reward-model overoptimization study measured the shape directly: push RL against a learned reward and true quality rises, peaks, and then declineseven as the reward model’s score keeps going up. The gap between the two curves is the model gaming the referee.
§ 01 · REWARDS YOU CAN CHECKSwap the reward model for a verifier
The RLVR insight is almost embarrassingly simple: in any domain where you can checka final answer, you don’t need a learned reward model at all. Math has a ground-truth answer. Code has unit tests. A formal proof has a checker. Use the verifier as the reward— 1 if it passes, 0 if it doesn’t — and the proxy problem evaporates, because the reward is the objective, not a stand-in for it.
This is the engine behind the 2026 reasoning-model wave. DeepSeekMath and then DeepSeek-R1 showed you can elicit long, correct chains of thought using little more than answer-checking as the reward; Tülu 3 made RLVR a named, reproducible stage in an open post-training recipe. Drag the lab: with a learned reward the curves split; with a verifiable checker they’re the same line.
A learned reward model is a proxy. The policy learns to maximize the proxy — and past a point that means exploiting the RM's blind spots: the copper reward line keeps rising while the forest accuracy line peaks and falls. The gap between them is reward hacking. Curves are illustrative of reward overoptimization, not a specific run.
The catch is right there in the name — verifiable. RLVR only works where correctness is cheap to check. That’s a real constraint, and §04 is about its edges. But where it applies, it removes the single most expensive and most gameable component of the RLHF stack.
§ 02 · DENSER SIGNAL: PROCESS REWARDSScore the steps, not just the answer
Answer-checking gives one bit of signal per attempt: right or wrong. For a twenty-step derivation that lands on the wrong number, that single bit can’t say where it went wrong. A process reward modelProcess reward model. A reward model (or verifier) that scores each intermediate reasoning step, not only the final answer — giving dense, per-step credit assignment. (PRM) scores each step, turning one late bit into a signal at every point in the chain.
OpenAI’s “Let’s Verify Step by Step” showed process supervision outperforms outcome supervision on hard math — dense credit assignment is worth the extra labeling. The 2026 move is to generate the step labels rather than hand-annotate them: roll out many continuations from each step and score a step by how often it leads to a correct final answer. Verifiable outcomes bootstrap a process reward — no human step-labeling required.
§ 03 · GRPO — THE CHEAP WAY TO RUN ITDrop the critic; let the group be the baseline
Standard PPO needs a second network — a value model — the same size as the policy, to estimate a baseline for the advantage. That doubles memory and adds its own training instability. GRPO (Group Relative Policy Optimization, from DeepSeekMath) deletes it: sample a groupof answers to the same prompt, score them all with the verifier, and use the group’s mean score as the baseline. An answer that beats its group’s average gets a positive advantage; one below it gets negative.
The payoff is practical: no value network means roughly half the memory and one fewer thing to tune, which is a large part of why RLVR became something a small team — or a single GPU — can run. The companion blueprint uses exactly this: GRPO, a group of sampled answers, a math checker for the reward.
§ 04 · WHERE IT WORKS, WHERE IT BREAKSVerifiable is a feature and a fence
RLVR’s strength is its boundary. It shines wherever a cheap, trustworthy checker exists — math, code, formal logic, structured extraction, anything with tests. It struggles where “correct” is a matter of taste or can’t be checked cheaply — essay quality, tone, safety nuance, open-ended design. For those, a learned reward model is still the tool; the frontier is hybrid— verifiable rewards where you can check, a reward model (ideally gated by whatever checks you do have) where you can’t.
Two failure modes to respect even inside the verifiable zone. A weak verifier is a new proxy: tests that only cover the happy path get gamed just like an RM — the model writes code that passes the tests and nothing else. And verifiable-reward training sharpens what a model can already sometimes domore than it teaches genuinely new skills — it raises pass@1 toward the model’s pass@k, which is powerful but not unbounded. Check your checker, and don’t expect RL to conjure capability that isn’t latent in the base model.
§ · FURTHER READINGReferences & deeper sources
- (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning (introduces GRPO) · arXiv:2402.03300
- (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL · arXiv:2501.12948
- (2023). Let's Verify Step by Step (process reward models) · arXiv:2305.20050
- (2024). Tülu 3: Pushing Frontiers in Open Language Model Post-Training (RLVR) · arXiv:2411.15124
- (2022). Scaling Laws for Reward Model Overoptimization · arXiv:2210.10760
- (2026). Blueprint: Train a Reasoner with GRPO · Brain Drip Blueprints
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.