Three Agent Papers, April 2026
Three papers landed in a single month that materially change how production teams should think about agents. Hyperagents (Meta FAIR) makes latency constant under fan-out. Recursive Language Models (MIT) formalizes what teams were already doing with retry loops. GMPO (Microsoft / ICLR 2026) swaps one operator in RLHF and gets 13% better on agent reasoning.
§ 00 · THREE PAPERS, ONE MONTHWhy these are worth reading together
April 2026 was, for production-agent practitioners, the kind of month that makes the rest of the year feel slow. Three papers landed from three different labs, each addressing a different layer of the agent stack — orchestration, the model loop itself, and the training objective behind it. The three are unrelated technically but tightly coupled in practice; you could stack all three on a single agent and get measurable improvements at every layer.
Production teams won’t adopt all three this quarter. But the direction is worth committing to memory. The summaries below are short on purpose — enough to know what each paper does, when it would help you, and what the interactive lab beneath it makes concrete.
§ 01 · HYPERAGENTS (Meta FAIR)One agent runs a hundred sub-agents in parallel
The architecture: a planner agent decomposes the top-level goal into independent subtasks. Up to a hundred sub-agents execute those subtasks in parallel, each in its own context window. An aggregator agent collects the sub-agent outputs and produces the final result. The novelty isn’t any individual piece — fan-out / fan-in is older than the LLM era — it’s the observation that latency stays roughly constant as work scales, because the bottleneck is the slowest sub-agent plus the aggregator, not the sum of the sub-agents.
Concretely, the four moves in the paper:
- The planner decomposes the goal into independent subtasks, in JSON.
- Fan out — each subtask gets its own ephemeral context.
- Merge the fleet — an aggregator combines outputs, resolves conflicts, produces the final answer.
- Degrade gracefully — some sub-agents will fail; the architecture tolerates partial completion rather than retrying the whole fleet.
Where this matters most: tasks with clear decomposition boundaries (analyze 50 docs, classify 1000 tickets, summarize 80 transcripts) and where end-to-end latency is the bottleneck. Where it matters least: tasks with strong sequential dependencies where each step needs the previous step’s output.
Hyperagent latency stays roughly constant as fan-out grows — the bottleneck is the slowest sub-agent plus the aggregator, not the sum. Caveat: some sub-agents will fail; the architecture has to degrade gracefully, not retry the whole fleet.
§ 02 · RECURSIVE LANGUAGE MODELS (MIT)The model calls itself on its own output
Recursive Language Models (RLMs) formalize a pattern production teams were already using under various names — “reflect and revise”, “self-critique”, “refine loop.” The model produces a draft. It then critiques the draft against the original goal. It revises. The loop runs until a convergence criterion fires.
The recursive framing replaces hand-written retry-and-refine scaffolding with a single primitive. The model is the loop; each recursive call is the model reading its previous output and emitting either a better version or a signal that the current version is good enough.
Four operational details:
- The base case is the initial draft, produced from the original prompt.
- The recursion is the critique-then-refine step: the model reads its previous output and produces a new one.
- The stopping criterion is the only parameter that matters — convergence on output, a maximum recursion depth, or a quality threshold reached.
- When to use it — reasoning-heavy tasks where the first draft is rarely the best one. Not for latency-critical paths; recursion adds latency proportional to depth.
The architectural payoff: recursion replaces hardcoded retry logic. The model becomes its own quality gate, instead of you writing if-then code to decide when to retry. Combine with the verification cascade from the Verifying AI Code drip and the recursion becomes self-validating — the model only stops recursing once its own critic is satisfied.
§ 03 · GMPO (Microsoft / ICLR 2026)Replace the mean. 13% better.
The cleanest one-line summary of any RLHF paper this year. GRPO (Shao et al., 2024) maximizes the arithmetic mean of token-level rewards across a generated sequence. Outlier tokens with anomalously high reward produce extreme importance-sampling ratios, and the policy gradient update either collapses or oscillates. GMPO swaps in the geometric mean. The math crushes outliers naturally; the policy update stays stable.
The four moves:
- Geometric mean, not arithmetic. One operator swap in the loss function.
- Stable updates. Outliers no longer dominate the gradient.
- The numbers — 13% improvement on agent reasoning benchmarks across the suite Microsoft published.
- Plug-and-play swap. Drop-in replacement in existing RLHF stacks; no retraining of the base model required.
For production teams, GMPO is downstream of you — it affects the next generation of models, not your code. But knowing that the next round of agent-trained models will be measurably better at reasoning tasks for one cheap operator change is the kind of context that should affect your roadmap timing.
Drag the outlier reward up. The arithmetic mean blows up; the geometric mean barely moves. In GRPO, the outlier produces extreme importance-sampling ratios and the policy update collapses. GMPO replaces the operator and the math crushes outliers naturally.
§ 04 · WHAT THEY SHAREThree different shapes, one consistent move
Read the three together and a pattern emerges. Each paper identifies a place where production teams were already improvising — fan-out for parallel work, retry loops for refinement, ad-hoc reward shaping for stable training — and replaces it with a clean primitive. Hyperagents formalizes decomposition. RLMs formalize iteration. GMPO formalizes outlier handling. The improvement in each case isn’t the invention of a new technique; it’s the crystallization of a hack into something with a name, a spec, and an interface.
That’s the consistent move: the research is catching up to what production teams already do. The papers worth tracking in the next few months are the ones that name the next layer of hacks practitioners are already running.
§ · FURTHER READINGReferences & deeper sources
- (2026). Hyperagents — One Agent Runs a Hundred Sub-Agents in Parallel · arXiv, April 2026
- (2026). Recursive Language Models · arXiv, April 2026
- (2026). GMPO — Geometric-Mean Policy Optimization for Agent Reasoning · ICLR 2026
- (2024). DeepSeekMath — introducing GRPO (Group Relative Policy Optimization) · arXiv:2402.03300
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.