Harness Engineering
When your agent fails, the model is rarely the problem. The harness — the code wrapping the model call — is where reliability is decided. Five layers cover most of it, and Datadog’s March 2026 data tells you which ones matter most.
§ 00 · THE HARNESS IS THE SYSTEMThe model is a function. The harness is the system.
Most production AI outages don’t look like model failures. They look like one of: the model returned a 429, the retry loop ran forever, the cost dashboard exploded, a single tenant starved everyone else, a downstream tool failed and the agent looped on it until the token budget was gone. None of these are problems with the model. They’re problems with what surrounds the model.
Sarah Chen’s April 2026 essay made the framing explicit: the model is a stateless function with no error handling and no guarantees about availability. Everything else — retries, circuit breakers, rate budgets, fallback routing, observability, cost controls — lives in the harness. The harness is where reliability is decided. The model is just one of the dependencies it has to keep alive.
This essay walks through the five layers that show up in every production harness worth copying. Each layer addresses a different failure mode. None of them is sufficient on its own; all five stacked make a 4% upstream error rate disappear to the user. The interactive lab in §08 lets you watch the stack work end-to-end.
§ 01 · RETRIES WITH SANE BACKOFFThe cheapest reliability layer, and the easiest to break
Bounded retries handle the “the upstream blipped” case. Three attempts max. Exponential backoff (250ms, 1s, 4s feels right). Jitter on every delay so a thundering herd of clients doesn’t retry in lockstep. Retry only on the codes that make retry sensible — 429, 503, network timeouts — not on 400-class errors that will always fail.
async function withRetry<T>(fn: () => Promise<T>, max = 3) {
let lastErr: unknown;
for (let i = 0; i < max; i++) {
try {
return await fn();
} catch (err) {
if (!isRetryable(err)) throw err;
lastErr = err;
const base = 250 * Math.pow(4, i); // 250ms, 1s, 4s
const jitter = Math.random() * base * 0.25;
await sleep(base + jitter);
}
}
throw lastErr;
}The mistake most teams make: unbounded retries, or retries that don’t know what to retry. An agent that loops on a 400 Bad Request 50 times costs you 50× the tokens without any chance of succeeding. Cap the count, gate on the error type, and trust the next layer.
§ 02 · CIRCUIT BREAKERSStop retrying once the upstream is clearly down
Retries handle a single bad request. They don’t handle the case where 30% of your requests are failing because the upstream is genuinely struggling. In that scenario, every retry adds load to a system that’s already overloaded — a textbook amplification. The circuit breaker pattern, lifted directly from Netflix’s Hystrix library (circa 2012) and now standard in agent harnesses, fixes this.
The state machine is simple. The breaker has three states:
- Closed. Normal operation. Requests flow through. Failures and successes are tallied in a rolling 60-second window.
- Open. Once the failure rate exceeds a threshold (25% is a reasonable default), the breaker opens. Subsequent requests fail fast without ever calling the upstream. This is what stops the retry storm.
- Half-open. After a cool-down (30–60 seconds), one probe request is allowed through. If it succeeds, the breaker closes and traffic resumes. If it fails, the breaker stays open and the cool-down resets.
Per-tool circuit breakers — not one global breaker — are the right granularity for agentic systems. The agent might be using five different MCP servers, three of which are healthy. A global breaker would stall the whole agent because one tool is down. Per-tool breakers let the agent route around the failure and keep working.
To feel why the breaker matters, play an incident. Set how badly the upstream is degraded and how many times you retry, then toggle the breaker and step through the 60-second window.
With the breaker off, a 70% failure rate and 3 attempts pushes 2.19× the load onto an upstream that is already struggling — every doomed request retried in full. Flip the breaker on and it trips after the first bad window, collapsing windows 2–6 to a single half-open probe each: the load multiplier falls toward 1× and the wasted-token bill all but vanishes. Numbers are an illustrative 1,000-request incident, not measured production data.
Retries help one bad request, but during a real incident they multiply load on the system that is already struggling — the breaker is what converts a self-amplifying outage into a bounded, cheap fast-fail. That is the exact intuition behind the §08 quiz answer.
§ 03 · CAPACITY ENGINEERINGTreat LLM capacity like any other constrained resource
Here’s the number that should change how you think about AI reliability. In February 2026, Datadog found ~5% of all LLM call spans errored, and roughly 60% of those errors were rate limits — capacity, not model quality. By March the dataset logged nearly 8.4 million rate-limit errors in a single month (about a third of all LLM errors that month). Not model errors. Not network errors. Rate limits, across production AI workloads.
Your prompt is fine. Your throughput is the bottleneck. The team that ships the most reliable agent isn’t the one with the best prompt — it’s the one that thinks about their token budget the way a database engineer thinks about connection pools. Three patterns matter:
- Per-key (sub-key) rate limits.Your provider gives you a top-level quota. You partition it among tenants / features / cron jobs so no single consumer can starve the rest. A noisy batch job at 2am shouldn’t take down interactive traffic.
- Backpressure on the queue.When you can’t serve, reject the work immediately rather than queueing it for later. A queue that grows unbounded during a capacity incident becomes a multi-hour outage even after the incident ends.
- Token-aware admission. Reject incoming requests if the expected token cost would push you past your current per-minute budget. Better to fail one request fast than to fail every request later.
§ 04 · BOUNDED SCOPEThe agent that refuses things is the one that ships
Reliability isn’t just “does it return successfully under load.” It’s also “does it return the right thing.” The Data Science Collective’s April 2026 piece on bounded-scope agents reframed this: the best production agents are narrow, and they know what they don’t own.
A support agent handles tickets. It doesn’t touch billing. The boundary is the safety mechanism. When a user asks the support agent to refund a charge, the right behavior is to refuse and route to the billing system — not to attempt the refund and hope nothing breaks.
The implementation pattern is an allow-list of actions enforced at the harness level, not at the model level. The model can’t hallucinate its way past it because the tool invocation goes through a router that checks the action against the allow-list before any side effect runs.
const SUPPORT_AGENT_ALLOWS = new Set([
"read_ticket",
"search_kb",
"create_internal_note",
"escalate_to_billing",
]);
async function callTool(toolName: string, args: unknown) {
if (!SUPPORT_AGENT_ALLOWS.has(toolName)) {
return {
ok: false,
error: `Tool '${toolName}' not in this agent's scope. Use escalate_to_billing.`,
};
}
return await TOOLS[toolName](args);
}Refusal rate becomes a quality metric. An agent that never refuses is suspicious — it’s either operating with too broad a scope or it’s papering over things it shouldn’t. An agent that refuses cleanly and routes the request elsewhere is operating inside its lane.
§ 05 · MODEL ROUTINGSonnet for the hard parts. Haiku for the rest.
Most production agents pay frontier prices for tasks a smaller model handles fine. Classifying a support ticket. Summarizing a meeting transcript. Extracting a structured field from a form. These don’t need Sonnet. Sending every request to Sonnet is the “everything is a select * query” of production AI — it works, and it’s 10× more expensive than it needs to be.
Model routing is the harness layer that decides per request which tier to call. The classic shape:
- Classify the request (a small cheap model labels difficulty: trivial / standard / complex).
- Route by tier — Haiku for trivial, Sonnet for standard, Opus for complex.
- Quality fallback— if Haiku’s output fails an eval check, retry with Sonnet. The fallback is rare if the classifier is honest.
- Measure per tier. Track cost and quality separately for each routing tier so you can tune the classifier’s thresholds.
The published economics: most teams report ~70% cost reduction from honest routing, with no measurable quality loss on the routed tiers — RouteLLM reports up to 75% at 95% of frontier quality, and tools like LiteLLM ship fallback routing on 429/5xx out of the box. The fallback catches the edge cases.
§ 06 · PROMPT CACHING90% discount most teams skip
Anthropic’s prompt caching, plus the equivalents at OpenAI and Google, change the economics of long system prompts. Mark a block as cacheable with one flag (cache_control: { type: “ephemeral” }) and subsequent calls reading the same prefix cost ~10% of the normal input price (a 90% discount; the first write costs 1.25×, so caching pays off once a prefix is reused even a few times). The mechanisms differ across providers: OpenAI applies caching automatically for a 50% discount with no flag, while Anthropic’s explicit cache_control breakpoints earn the steeper ~90% cut.
The pattern that fits most production agents:
- The system prompt + AGENTS.md + any stable retrieved context sits at the start of the request, marked cacheable.
- The user’s current turn and any volatile context comes after the cache breakpoint.
- Verify cache hits via the API’s response metadata — you’ll see a
cache_read_input_tokensfield that distinguishes cached vs uncached input tokens.
For an agent making 1M calls a day with 8K of stable prefix each, do the arithmetic: 1M × 30 days = 30M calls/month, and 8K tokens × 30M calls = 240,000 MTok of prefix. At Sonnet’s $3/MTok input rate that prefix costs ~$720K/month at full price; cache reads at 10% bring it to ~$72K, so the cache saves on the order of $648K/month — large enough that the exact assumptions, not the headline figure, are what matter. Layer multiple breakpoints (system prompt → KB → recent context → current turn) to maximize cache hits across different stability tiers.
Rather than trust any single headline figure, derive your own. Size each prompt layer, mark it stable or volatile, place the breakpoint, and watch the monthly cost — with the live arithmetic shown beneath.
7.7K cacheable × 30M calls/mo × $3/MTok × 10% read = $149,963 cached vs $765,000 full → $615,038 saved/mo
Savings are determined entirely by how much stable prefix sits before the breakpoint. Mark any pre-breakpoint layer volatile and the discount zeroes out. With the defaults (system + AGENTS.md + KB = 7.7K stable, breakpoint after KB, 1M calls/day on Anthropic) the cache turns a ~$693K/mo prefix bill into ~$69K — about $624K/mo saved. Illustrative pricing at $3/MTok Sonnet input; write premium amortized over ~20 reuses.
The savings are determined entirely by how much stable prefix sits before the breakpoint — and a single volatile layer in the wrong spot zeroes out the discount. Compute the number from your own assumptions instead of trusting a round figure.
§ 07 · AGENTS.MD AS THE CONVENTION LAYEROne file. Every tool. Source-controlled.
The harness isn’t just runtime code. It’s also the conventions the agent follows — codebase rules, style, the contract between the agent and the project. There’s a quiet but real win here: OpenAI Codex, Cursor, Factory, Sourcegraph, and Google converged on a single shared spec: AGENTS.md, now stewarded by the Agentic AI Foundation under the Linux Foundation. Codex, Cursor, GitHub Copilot, Gemini CLI, and others read it natively. Claude Code is the notable holdout: it reads CLAUDE.md, but you can @-import AGENTS.md from it so a single file still drives every tool.
The shape is mundane and that’s the point. A markdown file at the root of the repo, four sections deep — the spec (what AGENTS.md must include), the codebase conventions, the constraints the agent must follow, and a list of tools and their access scopes. Most major agent CLIs read it automatically (and the holdouts can be pointed at it with one import line). Commit it, review it, evolve it like any other code artifact.
Treating conventions as code is what makes the rest of the harness possible. A retry policy means nothing if a different engineer’s prompt expects a different error format. Bounded scope means nothing if a new tool gets added without updating the allow-list. AGENTS.md is the shared substrate every other harness layer leans against.
§ 08 · THE FIVE LAYERS IN ONE DIAGRAMToggle each one and watch the system change
Before the lab, here is the whole stack as one picture: a single request running the gauntlet of all five gates, each owning one failure mode and each with its own exit path.
Reliability comes from layered gates, each owning one failure mode. Two recover and two stop traffic; the cache changes only cost — which is why fallback is the biggest reliability win and the cache pays for the rest.
The lab below ties the layers together. The scenario is a 10 million request/month production agent at the Datadog-observed median of 4% upstream error rate. Toggle each layer on or off and watch three numbers move: visible failure rate (what the user sees), p95 latency, and monthly cost. Sweep the upstream error rate slider to see what each layer does under an incident.
All layers off. This is what shipping straight against the model API looks like under load.
The Datadog State of AI Engineering logged nearly 8.4 million rate-limit errors in a single month (March 2026) — roughly a third of all LLM errors that month — and back in February found that ~60% of LLM errors were rate limits. Either way, capacity, not model quality, is the dominant failure mode, which means per-key budgets and fallback models carry more reliability weight than people realize. Stack all five and a 4% upstream error becomes invisible to the user.
Three things to notice. Fallback model is the biggest single reliability win.Routing 429s to Haiku takes the visible failure rate from “perceivable” to “invisible” under typical incident conditions, and cheaply. Prompt caching doesn’t change reliability but pays for the rest of the stack. The cost reduction is enough that the harness layers above it are essentially free. Per-key rate limits prevent the most common outage class— a single tenant’s burst exhausting your shared quota — without any need for code in the agent itself.
§ · FURTHER READINGReferences & deeper sources
- (2026). What Is Harness Engineering? · harness-engineering.ai (Apr 2, 2026)
- (2026). State of AI Engineering 2026 · Datadog
- (2018). Release It! Design and Deploy Production-Ready Software (2nd ed.) · Pragmatic Bookshelf
- (2014). CircuitBreaker · martinfowler.com (bliki)
- (2015). Exponential Backoff And Jitter · AWS Architecture Blog
- (2012). Introducing Hystrix for Resilience Engineering · Netflix Technology Blog (Nov 2012)
- (2026). Why AI Agents Keep Failing in Production · Medium — Data Science Collective
- (2026). Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith (AI Agent Clinic, Ep. 1) · Google Developers Blog (Apr 2026)
- (2025). RouteLLM: Learning to Route LLMs with Preference Data · ICLR 2025 — arXiv:2406.18665
- (2026). Fallbacks — Proxy Reliability · LiteLLM Docs
- (2026). Prompt Caching · Claude Docs
- (2024). Prompt Caching in the API · OpenAI
- (2026). Rate Limits · Claude Docs
- (2026). AGENTS.md — Open Format for Guiding Coding Agents · agents.md
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.