The one knob that defines GRPO: the group
GRPO's baseline is the group mean — for each prompt it samples num_generations answers, scores them all, and an answer's advantage is its score minus the group's average. So the group size is the central hyperparameter. Too small (2–3) and the baseline is noisy; too large and each step is slow. Eight is the common starting point.
# config.py
from trl import GRPOConfig
def make_config(max_seq: int = 2048) -> GRPOConfig:
return GRPOConfig(
# --- the group ---
num_generations=8, # answers sampled per prompt
# --- lengths ---
max_prompt_length=256,
max_completion_length=max_seq - 256,
# --- optimization ---
learning_rate=5e-6, # RL wants a *small* LR
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
optim="adamw_8bit",
# --- batching (per device) ---
per_device_train_batch_size=8, # a multiple of num_generations
gradient_accumulation_steps=1,
# --- run length ---
max_steps=300, # a few hundred steps shows the lift
logging_steps=1,
save_steps=100,
# --- generation temperature for exploration ---
temperature=1.0,
use_vllm=True, # Unsloth's fast generation path
output_dir="grpo-reasoner",
)Two things worth calling out. The learning rate is tiny (5e-6) — RL fine-tuning nudges the policy; a big LR collapses it. And per_device_train_batch_size should be a multiple of num_generations so groups stay whole within a batch.
The trainer
GRPOTrainer takes the model, the tokenizer, your reward functions (as a list — they're summed), and the dataset.
# train.py (part 1)
from trl import GRPOTrainer
from model import model, tokenizer
from data import build_dataset
from rewards import correctness_reward, format_reward
from config import make_config
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[format_reward, correctness_reward], # summed per completion
args=make_config(),
train_dataset=build_dataset("train"),
)Passing multiple reward functions is how you compose signals: here total reward per answer is format (0 or 0.5) + correctness (0 or 2.0). You can log and weight them separately, but the trainer just needs the list — it sums them into the scalar each answer's advantage is computed from.
Reference: TRL GRPOConfig · Unsloth GRPO notebook · RLVR — §03 GRPO (drip)