Drip · Agents & RAG · 17 min read

Harness Engineering

When your agent fails, the model is rarely the problem. The harness — the code wrapping the model call — is where reliability is decided. Five layers cover most of it, and Datadog’s March 2026 data tells you which ones matter most.

The bottom line. Sarah Chen (April 2026) framed it sharply: the model is a function, the harness is the system. Five layers — bounded retries, circuit breakers, per-key capacity budgets, fallback model routing, and prompt caching — handle most production failure modes. Datadog logged nearly 8.4M rate-limit errors in a single month (March 2026, about a third of all LLM errors), and in February found ~60% of LLM errors were rate limits — capacity, not model quality. The lab below toggles each layer and shows visible failure rate, p95 latency, and monthly cost moving in concert.

§ 00 · THE HARNESS IS THE SYSTEMThe model is a function. The harness is the system.

Most production AI outages don’t look like model failures. They look like one of: the model returned a 429, the retry loop ran forever, the cost dashboard exploded, a single tenant starved everyone else, a downstream tool failed and the agent looped on it until the token budget was gone. None of these are problems with the model. They’re problems with what surrounds the model.

Sarah Chen’s April 2026 essay made the framing explicit: the model is a stateless function with no error handling and no guarantees about availability. Everything else — retries, circuit breakers, rate budgets, fallback routing, observability, cost controls — lives in the harness. The harness is where reliability is decided. The model is just one of the dependencies it has to keep alive.

This essay walks through the five layers that show up in every production harness worth copying. Each layer addresses a different failure mode. None of them is sufficient on its own; all five stacked make a 4% upstream error rate disappear to the user. The interactive lab in §08 lets you watch the stack work end-to-end.

§ 01 · RETRIES WITH SANE BACKOFFThe cheapest reliability layer, and the easiest to break

Bounded retries handle the “the upstream blipped” case. Three attempts max. Exponential backoff (250ms, 1s, 4s feels right). Jitter on every delay so a thundering herd of clients doesn’t retry in lockstep. Retry only on the codes that make retry sensible — 429, 503, network timeouts — not on 400-class errors that will always fail.

async function withRetry<T>(fn: () => Promise<T>, max = 3) {
  let lastErr: unknown;
  for (let i = 0; i < max; i++) {
    try {
      return await fn();
    } catch (err) {
      if (!isRetryable(err)) throw err;
      lastErr = err;
      const base = 250 * Math.pow(4, i);  // 250ms, 1s, 4s
      const jitter = Math.random() * base * 0.25;
      await sleep(base + jitter);
    }
  }
  throw lastErr;
}

The mistake most teams make: unbounded retries, or retries that don’t know what to retry. An agent that loops on a 400 Bad Request 50 times costs you 50× the tokens without any chance of succeeding. Cap the count, gate on the error type, and trust the next layer.

§ 02 · CIRCUIT BREAKERSStop retrying once the upstream is clearly down

Retries handle a single bad request. They don’t handle the case where 30% of your requests are failing because the upstream is genuinely struggling. In that scenario, every retry adds load to a system that’s already overloaded — a textbook amplification. The circuit breaker pattern, lifted directly from Netflix’s Hystrix library (circa 2012) and now standard in agent harnesses, fixes this.

The state machine is simple. The breaker has three states:

Per-tool circuit breakers — not one global breaker — are the right granularity for agentic systems. The agent might be using five different MCP servers, three of which are healthy. A global breaker would stall the whole agent because one tool is down. Per-tool breakers let the agent route around the failure and keep working.

To feel why the breaker matters, play an incident. Set how badly the upstream is degraded and how many times you retry, then toggle the breaker and step through the 60-second window.

Lab · retry amplificationPlay a 60s upstream incident — watch retries multiply load and a breaker cap the damage
Upstream success rate30%
Max retry attempts3
Breaker open threshold25% fail
60s / 60s
365
10s
1
20s
1
30s
1
40s
1
50s
1
60s
Upstream load multiplier
0.37×
E[attempts] = 2.19×
Tokens burned on doomed retries
$0.00
0 extra calls
Requests fast-failed by breaker
828
breaker open

With the breaker off, a 70% failure rate and 3 attempts pushes 2.19× the load onto an upstream that is already struggling — every doomed request retried in full. Flip the breaker on and it trips after the first bad window, collapsing windows 2–6 to a single half-open probe each: the load multiplier falls toward 1× and the wasted-token bill all but vanishes. Numbers are an illustrative 1,000-request incident, not measured production data.

Retries help one bad request, but during a real incident they multiply load on the system that is already struggling — the breaker is what converts a self-amplifying outage into a bounded, cheap fast-fail. That is the exact intuition behind the §08 quiz answer.

§ 03 · CAPACITY ENGINEERINGTreat LLM capacity like any other constrained resource

Here’s the number that should change how you think about AI reliability. In February 2026, Datadog found ~5% of all LLM call spans errored, and roughly 60% of those errors were rate limits — capacity, not model quality. By March the dataset logged nearly 8.4 million rate-limit errors in a single month (about a third of all LLM errors that month). Not model errors. Not network errors. Rate limits, across production AI workloads.

Your prompt is fine. Your throughput is the bottleneck. The team that ships the most reliable agent isn’t the one with the best prompt — it’s the one that thinks about their token budget the way a database engineer thinks about connection pools. Three patterns matter:

§ 04 · BOUNDED SCOPEThe agent that refuses things is the one that ships

Reliability isn’t just “does it return successfully under load.” It’s also “does it return the right thing.” The Data Science Collective’s April 2026 piece on bounded-scope agents reframed this: the best production agents are narrow, and they know what they don’t own.

A support agent handles tickets. It doesn’t touch billing. The boundary is the safety mechanism. When a user asks the support agent to refund a charge, the right behavior is to refuse and route to the billing system — not to attempt the refund and hope nothing breaks.

The implementation pattern is an allow-list of actions enforced at the harness level, not at the model level. The model can’t hallucinate its way past it because the tool invocation goes through a router that checks the action against the allow-list before any side effect runs.

const SUPPORT_AGENT_ALLOWS = new Set([
  "read_ticket",
  "search_kb",
  "create_internal_note",
  "escalate_to_billing",
]);

async function callTool(toolName: string, args: unknown) {
  if (!SUPPORT_AGENT_ALLOWS.has(toolName)) {
    return {
      ok: false,
      error: `Tool '${toolName}' not in this agent's scope. Use escalate_to_billing.`,
    };
  }
  return await TOOLS[toolName](args);
}

Refusal rate becomes a quality metric. An agent that never refuses is suspicious — it’s either operating with too broad a scope or it’s papering over things it shouldn’t. An agent that refuses cleanly and routes the request elsewhere is operating inside its lane.

§ 05 · MODEL ROUTINGSonnet for the hard parts. Haiku for the rest.

Most production agents pay frontier prices for tasks a smaller model handles fine. Classifying a support ticket. Summarizing a meeting transcript. Extracting a structured field from a form. These don’t need Sonnet. Sending every request to Sonnet is the “everything is a select * query” of production AI — it works, and it’s 10× more expensive than it needs to be.

Model routing is the harness layer that decides per request which tier to call. The classic shape:

  1. Classify the request (a small cheap model labels difficulty: trivial / standard / complex).
  2. Route by tier — Haiku for trivial, Sonnet for standard, Opus for complex.
  3. Quality fallback— if Haiku’s output fails an eval check, retry with Sonnet. The fallback is rare if the classifier is honest.
  4. Measure per tier. Track cost and quality separately for each routing tier so you can tune the classifier’s thresholds.

The published economics: most teams report ~70% cost reduction from honest routing, with no measurable quality loss on the routed tiers — RouteLLM reports up to 75% at 95% of frontier quality, and tools like LiteLLM ship fallback routing on 429/5xx out of the box. The fallback catches the edge cases.

§ 06 · PROMPT CACHING90% discount most teams skip

Anthropic’s prompt caching, plus the equivalents at OpenAI and Google, change the economics of long system prompts. Mark a block as cacheable with one flag (cache_control: { type: “ephemeral” }) and subsequent calls reading the same prefix cost ~10% of the normal input price (a 90% discount; the first write costs 1.25×, so caching pays off once a prefix is reused even a few times). The mechanisms differ across providers: OpenAI applies caching automatically for a 50% discount with no flag, while Anthropic’s explicit cache_control breakpoints earn the steeper ~90% cut.

The pattern that fits most production agents:

For an agent making 1M calls a day with 8K of stable prefix each, do the arithmetic: 1M × 30 days = 30M calls/month, and 8K tokens × 30M calls = 240,000 MTok of prefix. At Sonnet’s $3/MTok input rate that prefix costs ~$720K/month at full price; cache reads at 10% bring it to ~$72K, so the cache saves on the order of $648K/month — large enough that the exact assumptions, not the headline figure, are what matter. Layer multiple breakpoints (system prompt → KB → recent context → current turn) to maximize cache hits across different stability tiers.

Rather than trust any single headline figure, derive your own. Size each prompt layer, mark it stable or volatile, place the breakpoint, and watch the monthly cost — with the live arithmetic shown beneath.

Lab · cache breakpoint plannerSize four prompt layers, place the breakpoint, derive the monthly savings yourself
System prompt1.5K
AGENTS.md1.2K
Retrieved KB5.0K
Current turn800
Cache breakpoint after
Provider
Calls per day1.00M
Cacheable tokens
7.7K
91% of prompt
Monthly cost · no cache
$765,000
cached: $149,963
Saved per month
$615,038
80% off

7.7K cacheable × 30M calls/mo × $3/MTok × 10% read = $149,963 cached vs $765,000 full → $615,038 saved/mo

Savings are determined entirely by how much stable prefix sits before the breakpoint. Mark any pre-breakpoint layer volatile and the discount zeroes out. With the defaults (system + AGENTS.md + KB = 7.7K stable, breakpoint after KB, 1M calls/day on Anthropic) the cache turns a ~$693K/mo prefix bill into ~$69K — about $624K/mo saved. Illustrative pricing at $3/MTok Sonnet input; write premium amortized over ~20 reuses.

The savings are determined entirely by how much stable prefix sits before the breakpoint — and a single volatile layer in the wrong spot zeroes out the discount. Compute the number from your own assumptions instead of trusting a round figure.

§ 07 · AGENTS.MD AS THE CONVENTION LAYEROne file. Every tool. Source-controlled.

The harness isn’t just runtime code. It’s also the conventions the agent follows — codebase rules, style, the contract between the agent and the project. There’s a quiet but real win here: OpenAI Codex, Cursor, Factory, Sourcegraph, and Google converged on a single shared spec: AGENTS.md, now stewarded by the Agentic AI Foundation under the Linux Foundation. Codex, Cursor, GitHub Copilot, Gemini CLI, and others read it natively. Claude Code is the notable holdout: it reads CLAUDE.md, but you can @-import AGENTS.md from it so a single file still drives every tool.

The shape is mundane and that’s the point. A markdown file at the root of the repo, four sections deep — the spec (what AGENTS.md must include), the codebase conventions, the constraints the agent must follow, and a list of tools and their access scopes. Most major agent CLIs read it automatically (and the holdouts can be pointed at it with one import line). Commit it, review it, evolve it like any other code artifact.

Treating conventions as code is what makes the rest of the harness possible. A retry policy means nothing if a different engineer’s prompt expects a different error format. Bounded scope means nothing if a new tool gets added without updating the allow-list. AGENTS.md is the shared substrate every other harness layer leans against.

§ 08 · THE FIVE LAYERS IN ONE DIAGRAMToggle each one and watch the system change

Before the lab, here is the whole stack as one picture: a single request running the gauntlet of all five gates, each owning one failure mode and each with its own exit path.

A request only reaches the model if every gate passes — or a gate has a safe exit.requestModel1Boundedretriestransient 429 / 503 /timeout → back off+ jitter, retryrecovers ↺2Circuitbreakersustained >25% fail →fast-fail, noupstream callstops traffic ✕3Per-keybudgettenant over per-keyTPM → reject atadmissionstops traffic ✕4Fallbackroutingprimary 429 / 5xx →reroute to Haiku,request survivesrecovers ↺5Promptcachestable prefix hit →10% input cost(cost, not reliability)recover (retries, fallback)stop the bleeding (breaker, budget)economic only (cache)
Fig 1A request reaches the model only if every gate passes — or a gate has a safe exit. Two gates recover (retries, fallback), two stop the bleeding (breaker, budget), and one (cache) is purely economic.

Reliability comes from layered gates, each owning one failure mode. Two recover and two stop traffic; the cache changes only cost — which is why fallback is the biggest reliability win and the cache pays for the rest.

The lab below ties the layers together. The scenario is a 10 million request/month production agent at the Datadog-observed median of 4% upstream error rate. Toggle each layer on or off and watch three numbers move: visible failure rate (what the user sees), p95 latency, and monthly cost. Sweep the upstream error rate slider to see what each layer does under an incident.

Lab · reliability stack10M req/mo at 4.0% upstream error — toggle harness layers, watch visible failure / latency / cost move
Upstream error rate4.0%
healthy (0.5%)illustrative load (4%) · Datadog Mar-26 ≈ 2%incident (15%)

All layers off. This is what shipping straight against the model API looks like under load.

Visible failure
4.00%
0.00pp vs no harness
p95 latency
2.20s
0.00s vs no harness
Monthly cost
$42,000
$0 vs no harness

The Datadog State of AI Engineering logged nearly 8.4 million rate-limit errors in a single month (March 2026) — roughly a third of all LLM errors that month — and back in February found that ~60% of LLM errors were rate limits. Either way, capacity, not model quality, is the dominant failure mode, which means per-key budgets and fallback models carry more reliability weight than people realize. Stack all five and a 4% upstream error becomes invisible to the user.

Three things to notice. Fallback model is the biggest single reliability win.Routing 429s to Haiku takes the visible failure rate from “perceivable” to “invisible” under typical incident conditions, and cheaply. Prompt caching doesn’t change reliability but pays for the rest of the stack. The cost reduction is enough that the harness layers above it are essentially free. Per-key rate limits prevent the most common outage class— a single tenant’s burst exhausting your shared quota — without any need for code in the agent itself.

CHECKAn agent in production starts seeing 30% upstream errors during what looks like a provider incident. Which harness layer is most important to have already in place?

§ · FURTHER READINGReferences & deeper sources

  1. Dr. Sarah Chen (2026). What Is Harness Engineering? · harness-engineering.ai (Apr 2, 2026)
  2. Datadog (2026). State of AI Engineering 2026 · Datadog
  3. Michael T. Nygard (2018). Release It! Design and Deploy Production-Ready Software (2nd ed.) · Pragmatic Bookshelf
  4. Martin Fowler (2014). CircuitBreaker · martinfowler.com (bliki)
  5. Marc Brooker (2015). Exponential Backoff And Jitter · AWS Architecture Blog
  6. Ben Christensen et al. (Netflix) (2012). Introducing Hystrix for Resilience Engineering · Netflix Technology Blog (Nov 2012)
  7. Data Science Collective (2026). Why AI Agents Keep Failing in Production · Medium — Data Science Collective
  8. L. Sala, J. Badish & F. Guan (Google) (2026). Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith (AI Agent Clinic, Ep. 1) · Google Developers Blog (Apr 2026)
  9. Ong et al. (UC Berkeley / Anyscale / LMSYS) (2025). RouteLLM: Learning to Route LLMs with Preference Data · ICLR 2025 — arXiv:2406.18665
  10. LiteLLM (2026). Fallbacks — Proxy Reliability · LiteLLM Docs
  11. Anthropic (2026). Prompt Caching · Claude Docs
  12. OpenAI (2024). Prompt Caching in the API · OpenAI
  13. Anthropic (2026). Rate Limits · Claude Docs
  14. Agentic AI Foundation (Linux Foundation) (2026). AGENTS.md — Open Format for Guiding Coding Agents · agents.md

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.