Step 4: The Verifiable Reward — Train a Reasoner with GRPO

Reward functions, not a reward model

In TRL, a GRPO reward function takes the batch of completions (and any pass-through dataset columns) and returns a list of floats — one score per completion. There is no learned reward model here. The "reward" is a few lines of Python that check the answer. That's the RLVR move made concrete.

Correctness — the verifier

Extract what's inside <answer>...</answer>, normalize it, and compare to the gold value. Match → high reward; miss → zero.

# rewards.py
import re
 
def extract_answer(text: str) -> str:
    m = re.search(r"<answer>\s*(.*?)\s*</answer>", text, re.DOTALL)
    if not m:
        return ""
    # keep it comparable: strip $, commas, whitespace
    return m.group(1).strip().replace(",", "").replace("$", "")
 
def correctness_reward(completions, answer, **kwargs):
    responses = [c[0]["content"] for c in completions]
    got = [extract_answer(r) for r in responses]
    return [2.0 if g == a else 0.0 for g, a in zip(got, answer)]

answer arrives automatically — it's the dataset column from Step 3. This function is the verifier: deterministic, ungameable in the way a learned RM isn't, because it compares to ground truth.

Format — make the answer findable

A pure correctness reward is sparse early on, when the model rarely gets the format right and the verifier can't even locate an answer to grade. A small, cheap format reward bootstraps the structure so the correctness signal has something to bite on.

FORMAT_RE = re.compile(r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>", re.DOTALL)
 
def format_reward(completions, **kwargs):
    responses = [c[0]["content"] for c in completions]
    return [0.5 if FORMAT_RE.search(r) else 0.0 for r in responses]

Keep the format reward small relative to correctness (0.5 vs 2.0). It's scaffolding — you want the model chasing correct, with format as a nudge, not the main prize. Over-weighting format is its own tiny reward hack: a model that produces beautiful empty structure.

Test the reward in isolation

Always unit-test a verifier before you train against it — a buggy reward trains a confident wrong model.

if __name__ == "__main__":
    good = [[{"content": "<reasoning>2+2</reasoning>\n<answer>72</answer>"}]]
    bad = [[{"content": "<reasoning>hmm</reasoning>\n<answer>13</answer>"}]]
    print("correct:", correctness_reward(good, answer=["72"]))  # [2.0]
    print("wrong:  ", correctness_reward(bad, answer=["72"]))   # [0.0]
    print("format: ", format_reward(good))                       # [0.5]

$ python rewards.py
correct: [2.0]
wrong:   [0.0]
format:  [0.5]

That's the entire "reward model" — and because it's ground truth, the reward-hacking gap from the drip's lab never opens up (as long as the checker is honest, which for exact-match arithmetic it is).

Reference: TRL — using reward functions · RLVR & Process Rewards (drip) · Verifying AI Code (drip)