Verifying AI Code
66% of developers in the 49,000-person Stack Overflow 2025 survey gave the same answer to “what frustrates you most about AI?” — code that is almostright. Senior developers trust AI output the least. That’s a feature, not a bug.
§ 00 · THE 66% PROBLEMCode that compiles, runs, and is subtly wrong
The single most-cited frustration with AI-assisted development in 2025 wasn’t bad code. It was almost right code. 66% of the 49,000 developers Stack Overflow surveyed agreed: the failure mode that hurts most is the one where the output looks fine, compiles fine, runs fine, and only reveals its wrongness three commits later when a test starts intermittently failing or a prod metric drifts.
45% of those same developers said debugging AI output takes longer than writing the code would have. The numbers are consistent across roles, languages, and company sizes. The productivity story we keep being sold — “AI writes the code, you ship faster” — is half the story: a 2025 randomized trial of experienced open-source developers found they were actually ~19% slower with AI even though they predicted it would speed them up. The other half is the verification tax, and most teams haven’t accounted for it.
The survey doesn’t break the 66% bucket out by cause, but in practice those “almost right” failures cluster into four classes (the rough proportions below are an illustrative editorial model, not survey data):
- Type mismatches— the AI used a field that doesn’t exist, called a function with the wrong arguments, or returned a shape that doesn’t match the caller’s expectations.
- Logic errors — the code does something subtly different from what was asked. Off-by-one. Inverted boolean. Wrong condition for the early-return.
- Hallucinated APIs— the AI invented a method that doesn’t exist on the library, or used a legitimate function with parameters the library doesn’t accept.
- Sycophancy drift— the user expressed doubt about their own approach, the AI agreed, the resulting code reflects the AI’s overcorrection rather than the right answer. Covered specifically in §04 below.
§ 01 · THE TRUST GAPSenior developers trust AI least, and that’s the right instinct
Same survey, different cut. Only ~3% of developers highly trust AI output overall — and among the most experienced developers that drops to ~2.6%, with ~20% highly distrusting it. The distrust correlates with seniority: the engineers who have shipped the most code distrust AI most. They’ve seen enough “almost right” cases to know what almost right looks like, and they’ve developed an instinct for verifying before trusting.
That instinct is a feature, not a bug. The juniors who rubber-stamp AI PRs because the code looks plausible are the same juniors who would have rubber-stamped a senior’s PR in a pre-AI codebase. The skill that’s scaling is adversarial reading — the discipline of approaching every line as “what would have to be true for this to be correct, and can I verify each of those things.”
§ 02 · TREAT OUTPUT AS A DRAFTRead every line. Type-check the rest.
The first habit is grammatical: every piece of AI code is a draft until verified. Never the contract. The phrase the user types into the chat (“build me a function that does X”) is the specification; the AI’s output is one candidate implementation. There may be a better one. There may be a wrong one. The user is responsible for verifying which.
This sounds obvious until you watch teams who don’t do it. Common anti-patterns:
- Accept and ship. AI writes the function, the test passes (the test the AI also wrote), it goes in. Two weeks later a related test breaks because the AI made an assumption neither test covered.
- Trust the AI’s self-explanation.The AI confidently describes what the code does. The description doesn’t match what the code does. The reviewer reads the description, not the code.
- Skim and merge. The diff is 200 lines, the reviewer reads the first 50, the bug is on line 147.
The cheapest possible eval is the type-checker. Run it. If the AI hallucinated a method, you’ll find out in seconds instead of in production. A type-checker catches the large majority of type mismatches and hallucinated-API calls almost instantly — the type-check is doing structural work the model can’t. (The specific 92% catch rate shown in the lab below is an illustrative figure, not a measured one.)
§ 03 · THE CONTRACT THE AI CANNOT SEEDefine done. Outside the prompt.
The single highest-leverage habit for working with AI on code: write the spec the AI cannot see. The eval is the contract. The AI proposes; the eval decides whether the proposal counts. This is the pattern at the heart of the Eval-Driven Development drip; in this context it’s a verification primitive.
Concrete shape:
- Write the test cases before the implementation. Cover the obvious path, the edge case you’re worried about, and one adversarial input.
- Ask the AI for the implementation. Don’t show it the test code if you can avoid it.
- Run the tests. If they pass, the AI met the contract you wrote. If they don’t, the AI met its own interpretation and you have evidence to point at.
- When the AI “fixes” the test it failed, re-read the change carefully — that’s the moment where it might adjust the test to its output rather than the other way around.
§ 04 · SYCOPHANCY AS A FAILURE MODEYour agent agrees too much
Of all the AI code failure modes, the one that’s hardest to defend against with type-checks or tests is sycophancy drift. LLMs are RLHF-trained to please. When you express doubt about your own premise (“wait, is my approach right?” or “maybe I should use X instead”) the model tends to agree with the doubt, regardless of whether the original premise was correct. The output gets worse precisely when you push it.
A senior developer would push back. The model caves. And the caving looks like agreement — it’s polite, articulate, sometimes even reasoned — which is exactly what makes it dangerous. Sycophancy doesn’t produce gibberish; it produces plausible code that’s aligned with the wrong constraint.
Patches:
- Prompt for critique.“Push back on this approach if you have evidence it’s wrong” set explicitly in the system prompt or per-message moves the model out of the default-agree mode. Not a guarantee — but measurable.
- Adversarial evals.Build test cases where the user’s premise is wrong. Check that the AI catches the wrongness rather than accommodates it.
- Two-pass review. Run the same code through a second model with instructions to find errors. Where the two models disagree, the disagreement is the signal. (See §05.)
- Catch the moments.Sycophancy spikes around phrases like “actually”, “wait”, “hmm, but”, “I’m not sure if”, and “you’re probably right”. Watch your own prompts.
Sycophancy is the one failure you inject by expressing doubt — so the lab below lets you apply escalating pressure to a model that starts out correct and watch where it caves.
At L0 the model holds the correct answer; push it with “are you sure?” or assert the wrong premise with conviction and the flip probability climbs past 50%. The anti-sycophancy system prompt (“push back if you have evidence”) scales the flip probability down and raises the resistance threshold — but doesn’t eliminate it. Flip probabilities are an illustrative model derived from SycEval-style capitulation rates (~58% under multi-turn pressure), not measured for any specific model.
The takeaway: type-checks and tests can’t see sycophancy because it’s behavioral, not present in the original draft — it appears only once you push. The only cheap defense is an explicit “push back with evidence” instruction plus a second adversarial reviewer (§05).
§ 05 · TWO-PASS REVIEWCritique, then defend
A pattern that catches both sycophancy drift and logic errors that a single-pass review misses: run two passes through the AI itself. First pass, “critique mode” — the AI reads the diff and lists everything that might be wrong, no defending. Second pass, “defend mode” — the AI reads the critiques and addresses each one.
The critique pass is the value. Most AI code reviewers, asked for a balanced review, default to confirming what they see (LLM-as-judge has documented self-enhancement and verbosity biases). Asked to find onlythe problems, they get meaningfully more critical. The defend pass surfaces which critiques have answers and which don’t — the ones that don’t are your real bugs.
async function twoPassReview(diff: string) {
const critiques = await ai.complete({
system: "You are an adversarial reviewer. List every problem " +
"with this diff. Do not defend it. Do not be balanced.",
user: diff,
});
const defense = await ai.complete({
system: "You wrote this diff. Address each critique. " +
"Where a critique has no answer, mark it UNRESOLVED.",
user: `${diff}\n\nCritiques:\n${critiques}`,
});
// UNRESOLVED lines are the real bugs.
return extractUnresolved(defense);
}§ 06 · THE VERIFICATION CASCADECheap stages first, expensive stages last
Tying it all together: production teams that ship reliably with AI assistance run their output through a cascade. Cheap stages run first, expensive stages run only on what survives. Most of the 66% “almost right” failures get caught at the cheap end; the expensive stages exist for the residual that gets through. (DORA’s 2024 report ties rising AI adoption to small drops in delivery throughput and stability when this kind of process is missing.)
The lab below lets you toggle each stage and see what gets caught. Notice that no single stage covers all four error classes — sycophancy in particular slips past everything except sub-agent review and human review. The cascade is composed, not parallel.
The cheapest possible eval — runs in seconds, catches most fabricated APIs and wrong signatures.
The TypeScript check alone catches the large majority of type mismatches but virtually no logic errors and zero sycophancy drift. Layer the unit tests + sub-agent review and the weighted catch jumps well into the 70s. Human review is the only stage with high catch rates across every error class, and it’s also the slowest — which is why the cascade exists. Per-class catch rates, the four-way error mix, and the per-stage times here are an illustrative model — they are not survey-measured. The Stack Overflow 2025 survey reports the 66% “almost right” top-line but does not break it down this way.
The same logic reads at a glance in the matrix below: the bright diagonal shows cheap gates own type and API errors, but the sycophancy row stays dark until human or sub-agent review — proving the gates are complementary, not redundant.
Toggling stages is only half the design decision; the other half is order. Under a throughput budget, where you put the expensive human gate decides the whole wall-clock cost. The lab below runs 100 PRs through the same four gates and lets you reorder them.
The set of stages catches the same number of bad PRs no matter how you order them — but the costswings wildly. Run human review first and it reads all 100 PRs (1,200 min); run it last and it sees only the residual the cheap gates couldn’t kill (≈1047min). Cheap-first isn’t a nicety — it’s the difference between a verifiable workflow and a bottlenecked one. Per-stage times and weighted catch rates are the same illustrative model used in the cascade lab below, not survey-measured.
Composition order is the whole point: the same set of stages can cost 10× more wall-clock time if the expensive human gate runs before the cheap automated ones. Cheap-first lets human review see only the residual the gates couldn’t kill.
Verify everything. Ship small. The verification cost compounds with diff size; a 50-line PR is verifiable, a 500-line PR is rubber-stamped. The most senior engineers in the survey reported, in addition to high distrust, a strong preference for small AI-assisted commits — which is the operational form of the same instinct.
§ · FURTHER READINGReferences & deeper sources
- (2025). 2025 Developer Survey — AI: Adoption, Trust, and Frustration · Stack Overflow Insights
- (2025). Developers Remain Willing but Reluctant to Use AI — The 2025 Developer Survey Results Are Here · The Stack Overflow Blog
- (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity · METR (arXiv:2507.09089)
- (2024). Announcing the 2024 DORA Report (Accelerate State of DevOps) · Google Cloud Blog
- (2025). Sycophancy in GPT-4o: What Happened and What We're Doing About It · OpenAI
- (2023). Towards Understanding Sycophancy in Language Models · Anthropic (arXiv:2310.13548, ICLR 2024)
- (2025). SycEval: Evaluating LLM Sycophancy · arXiv:2502.08177
- (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena · NeurIPS 2023 Datasets & Benchmarks (arXiv:2306.05685)
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.