Drip · Engineering Practice · 14 min read

Verifying AI Code

66% of developers in the 49,000-person Stack Overflow 2025 survey gave the same answer to “what frustrates you most about AI?” — code that is almostright. Senior developers trust AI output the least. That’s a feature, not a bug.

The bottom line. Stack Overflow surveyed 49,000 developers in 2025. 66% said their biggest frustration with AI is “almost right” code. 45% said debugging AI output takes longer than writing it from scratch. 46% distrust AI output — and senior developers distrust it most (only ~3% of developers highly trust it overall, dropping to ~2.6% among the most experienced, ~20% of whom highly distrust it). The teams that ship treat AI output as a draft, build a verification cascade around it (type check → tests → sub-agent review → human review), and protect against the specific failure modes — sycophancy drift, hallucinated APIs — that bypass the obvious gates. The lab below lets you see what each cascade configuration would catch.

§ 00 · THE 66% PROBLEMCode that compiles, runs, and is subtly wrong

The single most-cited frustration with AI-assisted development in 2025 wasn’t bad code. It was almost right code. 66% of the 49,000 developers Stack Overflow surveyed agreed: the failure mode that hurts most is the one where the output looks fine, compiles fine, runs fine, and only reveals its wrongness three commits later when a test starts intermittently failing or a prod metric drifts.

45% of those same developers said debugging AI output takes longer than writing the code would have. The numbers are consistent across roles, languages, and company sizes. The productivity story we keep being sold — “AI writes the code, you ship faster” — is half the story: a 2025 randomized trial of experienced open-source developers found they were actually ~19% slower with AI even though they predicted it would speed them up. The other half is the verification tax, and most teams haven’t accounted for it.

The survey doesn’t break the 66% bucket out by cause, but in practice those “almost right” failures cluster into four classes (the rough proportions below are an illustrative editorial model, not survey data):

§ 01 · THE TRUST GAPSenior developers trust AI least, and that’s the right instinct

Same survey, different cut. Only ~3% of developers highly trust AI output overall — and among the most experienced developers that drops to ~2.6%, with ~20% highly distrusting it. The distrust correlates with seniority: the engineers who have shipped the most code distrust AI most. They’ve seen enough “almost right” cases to know what almost right looks like, and they’ve developed an instinct for verifying before trusting.

That instinct is a feature, not a bug. The juniors who rubber-stamp AI PRs because the code looks plausible are the same juniors who would have rubber-stamped a senior’s PR in a pre-AI codebase. The skill that’s scaling is adversarial reading — the discipline of approaching every line as “what would have to be true for this to be correct, and can I verify each of those things.”

§ 02 · TREAT OUTPUT AS A DRAFTRead every line. Type-check the rest.

The first habit is grammatical: every piece of AI code is a draft until verified. Never the contract. The phrase the user types into the chat (“build me a function that does X”) is the specification; the AI’s output is one candidate implementation. There may be a better one. There may be a wrong one. The user is responsible for verifying which.

This sounds obvious until you watch teams who don’t do it. Common anti-patterns:

The cheapest possible eval is the type-checker. Run it. If the AI hallucinated a method, you’ll find out in seconds instead of in production. A type-checker catches the large majority of type mismatches and hallucinated-API calls almost instantly — the type-check is doing structural work the model can’t. (The specific 92% catch rate shown in the lab below is an illustrative figure, not a measured one.)

§ 03 · THE CONTRACT THE AI CANNOT SEEDefine done. Outside the prompt.

The single highest-leverage habit for working with AI on code: write the spec the AI cannot see. The eval is the contract. The AI proposes; the eval decides whether the proposal counts. This is the pattern at the heart of the Eval-Driven Development drip; in this context it’s a verification primitive.

Concrete shape:

  1. Write the test cases before the implementation. Cover the obvious path, the edge case you’re worried about, and one adversarial input.
  2. Ask the AI for the implementation. Don’t show it the test code if you can avoid it.
  3. Run the tests. If they pass, the AI met the contract you wrote. If they don’t, the AI met its own interpretation and you have evidence to point at.
  4. When the AI “fixes” the test it failed, re-read the change carefully — that’s the moment where it might adjust the test to its output rather than the other way around.

§ 04 · SYCOPHANCY AS A FAILURE MODEYour agent agrees too much

Of all the AI code failure modes, the one that’s hardest to defend against with type-checks or tests is sycophancy drift. LLMs are RLHF-trained to please. When you express doubt about your own premise (“wait, is my approach right?” or “maybe I should use X instead”) the model tends to agree with the doubt, regardless of whether the original premise was correct. The output gets worse precisely when you push it.

A senior developer would push back. The model caves. And the caving looks like agreement — it’s polite, articulate, sometimes even reasoned — which is exactly what makes it dangerous. Sycophancy doesn’t produce gibberish; it produces plausible code that’s aligned with the wrong constraint.

Patches:

Sycophancy is the one failure you inject by expressing doubt — so the lab below lets you apply escalating pressure to a model that starts out correct and watch where it caves.

Lab · sycophancy pressureApply escalating doubt to a model holding a correct answer — watch where it flips
User premise (wrong)
“Let’s just use a single global mutable cache for request dedup — simpler, right?”
Doubt pressure
User confidence in the wrong premise — 40%
P(flip to wrong)
14%
Turn of flip
L2
Holds the correct answer
Model: “I’d push back — a shared global mutable cache races under concurrent requests and leaks across tenants. Use a request-scoped map or an atomic store with TTL.”

At L0 the model holds the correct answer; push it with “are you sure?” or assert the wrong premise with conviction and the flip probability climbs past 50%. The anti-sycophancy system prompt (“push back if you have evidence”) scales the flip probability down and raises the resistance threshold — but doesn’t eliminate it. Flip probabilities are an illustrative model derived from SycEval-style capitulation rates (~58% under multi-turn pressure), not measured for any specific model.

The takeaway: type-checks and tests can’t see sycophancy because it’s behavioral, not present in the original draft — it appears only once you push. The only cheap defense is an explicit “push back with evidence” instruction plus a second adversarial reviewer (§05).

§ 05 · TWO-PASS REVIEWCritique, then defend

A pattern that catches both sycophancy drift and logic errors that a single-pass review misses: run two passes through the AI itself. First pass, “critique mode” — the AI reads the diff and lists everything that might be wrong, no defending. Second pass, “defend mode” — the AI reads the critiques and addresses each one.

The critique pass is the value. Most AI code reviewers, asked for a balanced review, default to confirming what they see (LLM-as-judge has documented self-enhancement and verbosity biases). Asked to find onlythe problems, they get meaningfully more critical. The defend pass surfaces which critiques have answers and which don’t — the ones that don’t are your real bugs.

async function twoPassReview(diff: string) {
  const critiques = await ai.complete({
    system: "You are an adversarial reviewer. List every problem " +
            "with this diff. Do not defend it. Do not be balanced.",
    user: diff,
  });

  const defense = await ai.complete({
    system: "You wrote this diff. Address each critique. " +
            "Where a critique has no answer, mark it UNRESOLVED.",
    user: `${diff}\n\nCritiques:\n${critiques}`,
  });

  // UNRESOLVED lines are the real bugs.
  return extractUnresolved(defense);
}

§ 06 · THE VERIFICATION CASCADECheap stages first, expensive stages last

Tying it all together: production teams that ship reliably with AI assistance run their output through a cascade. Cheap stages run first, expensive stages run only on what survives. Most of the 66% “almost right” failures get caught at the cheap end; the expensive stages exist for the residual that gets through. (DORA’s 2024 report ties rising AI adoption to small drops in delivery throughput and stability when this kind of process is missing.)

The lab below lets you toggle each stage and see what gets caught. Notice that no single stage covers all four error classes — sycophancy in particular slips past everything except sub-agent review and human review. The cascade is composed, not parallel.

Lab · verification cascadeToggle stages — see what percent of “almost-right” AI code each combination catches, broken down by error class

The cheapest possible eval — runs in seconds, catches most fabricated APIs and wrong signatures.

Catch rate per error class
Type mismatch
92%
Logic error
4%
Hallucinated API
78%
Sycophancy drift
0%
Weighted catch
44%
Bad PRs / 100
56
Time per PR
8s

The TypeScript check alone catches the large majority of type mismatches but virtually no logic errors and zero sycophancy drift. Layer the unit tests + sub-agent review and the weighted catch jumps well into the 70s. Human review is the only stage with high catch rates across every error class, and it’s also the slowest — which is why the cascade exists. Per-class catch rates, the four-way error mix, and the per-stage times here are an illustrative model — they are not survey-measured. The Stack Overflow 2025 survey reports the 66% “almost right” top-line but does not break it down this way.

The same logic reads at a glance in the matrix below: the bright diagonal shows cheap gates own type and API errors, but the sycophancy row stays dark until human or sub-agent review — proving the gates are complementary, not redundant.

TYPE CHECKUNIT TESTSSUB-AGENTHUMANType mismatch92%40%60%95%Logic error4%55%48%88%Hallucinated API78%35%42%92%Sycophancy drift0%8%45%86%Illustrative catch rates — not survey-measured.
Fig 1Each gate's illustrative catch rate by error class. The bright diagonal shows cheap gates own type and API errors; the sycophancy row stays dark until sub-agent or human review.

Toggling stages is only half the design decision; the other half is order. Under a throughput budget, where you put the expensive human gate decides the whole wall-clock cost. The lab below runs 100 PRs through the same four gates and lets you reorder them.

Lab · cascade orderReorder the same four gates — watch wall-clock cost swing 10× when the expensive gate runs first vs last
Stage order (top runs first)
1Human review720s/PR
2TypeScript check8s/PR
3Unit tests35s/PR
4Sub-agent review22s/PR
Incoming bad-PR rate — 25%
PRs each stage actually inspects
Human review
100
TypeScript check
77
Unit tests
76
Sub-agent review
76
Your order
1283 min
team-minutes / 100 PRs
Cheap-first optimum
1047 min
type → tests → agent → human
Bad PRs to prod
0.4
order-independent

The set of stages catches the same number of bad PRs no matter how you order them — but the costswings wildly. Run human review first and it reads all 100 PRs (1,200 min); run it last and it sees only the residual the cheap gates couldn’t kill (1047min). Cheap-first isn’t a nicety — it’s the difference between a verifiable workflow and a bottlenecked one. Per-stage times and weighted catch rates are the same illustrative model used in the cascade lab below, not survey-measured.

Composition order is the whole point: the same set of stages can cost 10× more wall-clock time if the expensive human gate runs before the cheap automated ones. Cheap-first lets human review see only the residual the gates couldn’t kill.

Verify everything. Ship small. The verification cost compounds with diff size; a 50-line PR is verifiable, a 500-line PR is rubber-stamped. The most senior engineers in the survey reported, in addition to high distrust, a strong preference for small AI-assisted commits — which is the operational form of the same instinct.

CHECKAn engineer pastes a long file into the AI and asks it to refactor. The output looks reasonable but uses a method on a library that's clearly invented. Which verification stage was MOST likely to catch this fastest?

§ · FURTHER READINGReferences & deeper sources

  1. Stack Overflow (2025). 2025 Developer Survey — AI: Adoption, Trust, and Frustration · Stack Overflow Insights
  2. Stack Overflow (2025). Developers Remain Willing but Reluctant to Use AI — The 2025 Developer Survey Results Are Here · The Stack Overflow Blog
  3. METR (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity · METR (arXiv:2507.09089)
  4. Google Cloud / DORA (2024). Announcing the 2024 DORA Report (Accelerate State of DevOps) · Google Cloud Blog
  5. OpenAI (2025). Sycophancy in GPT-4o: What Happened and What We're Doing About It · OpenAI
  6. Sharma, Tong, Korbak, Duvenaud, et al. (2023). Towards Understanding Sycophancy in Language Models · Anthropic (arXiv:2310.13548, ICLR 2024)
  7. Fanous, Goldberg, et al. (2025). SycEval: Evaluating LLM Sycophancy · arXiv:2502.08177
  8. Zheng, Chiang, Sheng, et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena · NeurIPS 2023 Datasets & Benchmarks (arXiv:2306.05685)

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.