The model is the same for everyone. The framework is none. What separates someone who actually gets agent architecture from someone who has only read about it is whether they've ever built the smallest version and watched it run.

You can read about tool bloat or the two-layer sandwich for a week, nod at every diagram, and still not be able to reproduce any of it. Ideas don't transfer at the altitude of diagrams. They transfer once they've passed through your fingers.

So I built four. Each is a tiny companion repo — sixty to eighty lines, no framework, running offline with a mock model so the architecture is the only thing on screen. Each one swaps in a real model with a one-line change. They live in companion-codebases/ in this repo, and each has a Build Along drip that walks the code line by line.

There's one sentence underneath all four, and it's worth saying before any of them: let the model do the single fuzzy thing it's genuinely good at, and wrap it in deterministic code you can trust. Hold onto that. It's the whole essay.

1. Multi-MCP — route first, expose second

The advice is right: build many small MCP servers, not one monolith. Then you connect all of them and hit the catch nobody warned you about — the agent re-reads every tool's description on every single turn.

Three servers, six tools. Fine. Thirty servers, sixty-plus tools, and the model's accuracy at picking the right one falls off a cliff. It's wading through an inventory before it's allowed to act, and most of that inventory is irrelevant to the request in front of it.

WITHOUT ROUTING WITH ROUTING agent weather current · forecast docs search · get database query · schema 6 tools in context — every turn agent router weather current · forecast picks 1 of 3 2 tools — the rest stay dark
Three servers is only six tools. Thirty servers is the same picture, sixty tools deep. The router scopes the inventory before the model ever sees it.

The fix isn't a bigger model or a politer system prompt. It's a router that decides which servers are even relevant first, and exposes only their tools:

For every connected server, ask: could this plausibly help with the request in front of me? If not, it has no business in the model's context.

At three servers the win is small. The point is the slope — every server you add is permanent context tax unless a router stands in front of it. The full walkthrough is in the Multi-MCP build; the why is in Multi-MCP Architecture.

2. Eval-harness — prompts are code, so test them like code

A prompt change is a code change with no type checker and no test suite. You feel productive right up until a "small tweak" quietly breaks a case you fixed last week, and nobody notices for a month.

An eval is the missing test suite: a dataset of labeled cases, a thing under test, scorers that decide pass or fail, and a gate that can stop a release.

dataset.jsonl 7 labeled cases system under test scorers exact_match GATE threshold 90% 6/7 = 86% → exit 1 FAIL positive got: negative "Not bad at all, honestly impressed."
The case that earns its keep is the sneaky one — "Not bad at all," which keyword matching reads as negative. A spot-check sails past it. The eval doesn't, and the gate refuses to ship.

The part that turns "we should have evals" into "we do" is one line: the runner exits non-zero below threshold. if rate < THRESHOLD: sys.exit(1).

A pass rate you read is a chart. An exit code is a gate CI can refuse to merge past.

Build it once and changing a prompt stops feeling like defusing a bomb. The walkthrough is the eval-harness build; the bigger argument is Eval-Driven Development.

3. Agentic-ETL — the two-layer sandwich

Hand an LLM your entire extract-transform-load pipeline and it will map fields beautifully for a hundred rows, then map dollars to cents on row 101 and never tell you. Hand it nothing and you're back to parsers that shatter on the second vendor's CSV.

The shape that survives real data is a sandwich: the fuzzy agent in the middle, deterministic code on both sides.

STRICT EDGE FUZZY MIDDLE STRICT EDGE vendor A customer_name vendor B firstName+lastName normalize deterministic AGENT map → schema validate deterministic gate clean → load quarantine the empty-name row is quarantined, not loaded — the gate is plain, boring, and never written by the model
The agent does the genuinely hard part — reconciling customer_name and firstName + lastName onto one schema. A boring validation gate decides whether the result is allowed to land.

The agent is allowed to be wrong. It is never allowed to be the thing that decides whether it was wrong.

That's the line that keeps a confidently-wrong mapping out of your production table. The agentic-ETL build shows both failure modes the sandwich prevents at once; the field report is Agentic ETL.

4. Verify-loop — the model writes, the verifier decides

The most expensive AI-coding failure isn't code that breaks loudly. It's code that's almost right — it compiles, it reads clean, and it's subtly wrong. You merge it because nothing screamed.

The pattern that attacks this directly refuses to trust the first draft. Generate, run a verifier, and if it fails, hand the failure back and try again — bounded, then escalate to a human.

generate verify run the tests pass? accept yes no → feedback = report (the failing test output) fail after N tries: escalate to a human
Attempt one concatenates instead of adding; the verifier rejects it and says exactly why. Attempt two passes. No human ever saw the broken draft.

The load-bearing line is feedback = report — the model gets to read its own failing test output and correct course, the same way you would.

Stop asking "is the model good enough?" Start asking "what's my verifier?" The model writes; the verifier decides.

The verify-loop build is the whole thing in sixty lines; the cost-of-being-almost-right case is Verifying AI Code.

The one shape

Line the four up and they're the same move wearing four costumes.

strict code route · normalize · generate the model — one fuzzy step pick · judge · map · draft strict code expose · validate · verify fuzzy middle, strict edges
Pick a server, judge an output, map a messy field, draft the code — one ambiguous step the model is good at. Everything around it is deterministic code you can trust.

The model handles the one ambiguous step. Deterministic code handles everything around it — the routing, the gate, the validation, the verifier. That isn't a fence around what agents can do. It's the thing that lets them do it in production instead of in a demo.

Build the small version of each and the pattern stops being a diagram you nodded at. It becomes one you can reach for.