The model is the same for everyone. The framework is none. What separates someone who actually gets agent architecture from someone who has only read about it is whether they've ever built the smallest version and watched it run.
You can read about tool bloat or the two-layer sandwich for a week, nod at every diagram, and still not be able to reproduce any of it. Ideas don't transfer at the altitude of diagrams. They transfer once they've passed through your fingers.
So I built four. Each is a tiny companion repo — sixty to eighty lines, no framework, running offline with a mock model so the architecture is the only thing on screen. Each one swaps in a real model with a one-line change. They live in companion-codebases/ in this repo, and each has a Build Along drip that walks the code line by line.
There's one sentence underneath all four, and it's worth saying before any of them: let the model do the single fuzzy thing it's genuinely good at, and wrap it in deterministic code you can trust. Hold onto that. It's the whole essay.
1. Multi-MCP — route first, expose second
The advice is right: build many small MCP servers, not one monolith. Then you connect all of them and hit the catch nobody warned you about — the agent re-reads every tool's description on every single turn.
Three servers, six tools. Fine. Thirty servers, sixty-plus tools, and the model's accuracy at picking the right one falls off a cliff. It's wading through an inventory before it's allowed to act, and most of that inventory is irrelevant to the request in front of it.
The fix isn't a bigger model or a politer system prompt. It's a router that decides which servers are even relevant first, and exposes only their tools:
For every connected server, ask: could this plausibly help with the request in front of me? If not, it has no business in the model's context.
At three servers the win is small. The point is the slope — every server you add is permanent context tax unless a router stands in front of it. The full walkthrough is in the Multi-MCP build; the why is in Multi-MCP Architecture.
2. Eval-harness — prompts are code, so test them like code
A prompt change is a code change with no type checker and no test suite. You feel productive right up until a "small tweak" quietly breaks a case you fixed last week, and nobody notices for a month.
An eval is the missing test suite: a dataset of labeled cases, a thing under test, scorers that decide pass or fail, and a gate that can stop a release.
The part that turns "we should have evals" into "we do" is one line: the runner exits non-zero below threshold. if rate < THRESHOLD: sys.exit(1).
A pass rate you read is a chart. An exit code is a gate CI can refuse to merge past.
Build it once and changing a prompt stops feeling like defusing a bomb. The walkthrough is the eval-harness build; the bigger argument is Eval-Driven Development.
3. Agentic-ETL — the two-layer sandwich
Hand an LLM your entire extract-transform-load pipeline and it will map fields beautifully for a hundred rows, then map dollars to cents on row 101 and never tell you. Hand it nothing and you're back to parsers that shatter on the second vendor's CSV.
The shape that survives real data is a sandwich: the fuzzy agent in the middle, deterministic code on both sides.
customer_name and firstName + lastName onto one schema. A boring validation gate decides whether the result is allowed to land.The agent is allowed to be wrong. It is never allowed to be the thing that decides whether it was wrong.
That's the line that keeps a confidently-wrong mapping out of your production table. The agentic-ETL build shows both failure modes the sandwich prevents at once; the field report is Agentic ETL.
4. Verify-loop — the model writes, the verifier decides
The most expensive AI-coding failure isn't code that breaks loudly. It's code that's almost right — it compiles, it reads clean, and it's subtly wrong. You merge it because nothing screamed.
The pattern that attacks this directly refuses to trust the first draft. Generate, run a verifier, and if it fails, hand the failure back and try again — bounded, then escalate to a human.
The load-bearing line is feedback = report — the model gets to read its own failing test output and correct course, the same way you would.
Stop asking "is the model good enough?" Start asking "what's my verifier?" The model writes; the verifier decides.
The verify-loop build is the whole thing in sixty lines; the cost-of-being-almost-right case is Verifying AI Code.
The one shape
Line the four up and they're the same move wearing four costumes.
The model handles the one ambiguous step. Deterministic code handles everything around it — the routing, the gate, the validation, the verifier. That isn't a fence around what agents can do. It's the thing that lets them do it in production instead of in a demo.
Build the small version of each and the pattern stops being a diagram you nodded at. It becomes one you can reach for.
Comments
No comments yet. Be the first.