daler.dev

Code Review is a bottleneck

I started this post because I was frustrated watching our review queue grow while AI made our coding faster. The research surprised me — it's worse than I thought.

The main goal of any business is to make money. And to make money, you need to be competitive. When you have both quality and delivery speed, you become competitive. A product without quality can't be saved by speed alone. And vice versa. AI promised to solve one of these — speed.

AI does increase engineering leverage. Teams can prototype faster, explore alternatives faster, and remove a lot of mechanical work. The problem is not generation itself. The problem is that verification did not accelerate at the same rate.

AI has increased the speed of writing code, but not delivery. These aren't the same thing. Between writing code and delivering a feature there is review. And this is where the trouble starts.

Two issues here:

  • Reviewers cannot scale, the way code generation does.
  • AI code looks plausible and lulls reviewer attention.

This is where the crisis begins. The business sees that code is being written faster and pushes reviewers to do the same: faster, more PRs. Reviewers cut corners. Quality drops. Bugs appear. Customers leave. The speed everything was built on gets zeroed out by lost quality.

The Faros report1 confirms this with telemetry. Median time in PR review jumped 441%. Incidents per PR rose 242.7%. 31% of PRs are now merging with no review at all. Teams ship more code, but they also ship more bugs.

The issue is deeper than you think

The problem accumulates in the background, unseen. From the inside, it looks like slow PRs and tired reviewers. From the outside, it becomes visible too late — as bugs in production, as incidents, as customers who quietly left.

Technical debt is a trap. Bad code accumulates and becomes technical debt. Refactoring doesn't make money, so the business doesn't allocate time for it. The debt grows. New features become harder to ship, which forces more corner-cutting, which adds more debt. A vicious cycle you can't escape by waiting.

The obvious methods don't help. "Review better" is a useless suggestion. Most reviewers already try. The problem isn't effort — attention has a physical ceiling.

Adding more reviewers doubles the bottleneck, not fixes it. Two reviewers per PR means two people whose time is now the constraint, not one.

Tests are critical, but they only test what you thought to test. They don't catch the gaps in your imagination — and AI, trained on the same codebases, tends to have similar gaps.

AI reviewing AI has the same unseen spots and the same surface plausibility. You're not adding a second perspective, you're doubling the first one.

Local fixes can mask the problem for a while, but they don't fix it. The bottleneck is structural, and the solution has to be structural too.

What can work

AI writes code significantly faster than developers, but reviewers read that code at the same speed as years ago. This is mathematically unsolvable head-on — you can't just "review better."

There are exactly three ways to expand the bottleneck:

  1. Push less work into the bottleneck.
  2. Prioritize what gets the most attention.
  3. Verify some of it after the bottleneck.

These aren't arbitrary categories. They map to three points in time relative to review: before, during, and after. That's why they're parallel and cover the main axis of solutions.

There is no silver bullet — only an integrated approach across all three.

The author's job is to prove, not produce

The work of a PR author is not to "write code", but to "prove that the code works". If a PR comes to review without proof, the reviewer does the work the author should have done. And AI aggravates this: the author generates 500 lines of code in 10 minutes, but the reviewer needs to spend one hour to review it.

Small PRs must be treated as a hard constraint. The SmartBear and Cisco study shows us that the optimal PR size is 200-400 lines of code and review effectiveness drops after 60-90 minutes.2

Google's internal guidance is more aggressive: "100 lines is usually a reasonable size for a CL, and 1000 lines is usually too large… A 200-line change in one file might be okay, but spread across 50 files it would usually be too large."3 The OCaml maintainers' rejection of a 13,000-line AI-generated PR in November 2025 is the canonical cautionary tale: not on quality grounds, but because no one had the bandwidth to review such a massive change.4

Reviewers often complain about getting PRs that are too big, but rarely that they are too small. The asymmetry tells you which mistake is actually expensive.

Proof of work. Simon Willison5 articulates this most cleanly:

We need to deliver code that works — and we need to include proof that it works as well. Not doing that directly shifts the burden of the actual work to whoever is expected to review our code.

His two non-optional steps:

  1. Manual testing. You must have observed the code do the right thing yourself. Willison is blunt about this: "If you haven't seen the code do the right thing yourself, that code doesn't work. If it does turn out to work, that's honestly just pure chance." For PRs, this means a sequence of terminal commands with their output, or a screen capture for visual changes — pasted into the PR itself.
  2. Automated testing. A test that fails if you revert the implementation. This isn't coverage for coverage's sake — it's a test that actually exercises the behaviour the change introduces.

The reason both are non-optional becomes sharper with AI-generated code. AI is very good at producing code that looks like it works. Manual testing is the cheapest defense against plausible-but-wrong output: you run it, you see it, you know. Automated testing is the defense against the next person — including future you — silently breaking the behaviour six months from now.

Verification that actually verifies. Willison says that the proof is required. Addy Osmani's "PR Contract" 6 addresses the next question — how to package that proof so the reviewer can actually see it. A PR is not a request for someone else to figure out whether your code works. It's a delivery of working code with evidence attached.

In practice this means every PR should answer four questions before a human opens it:

  1. What and why — one or two sentences.
  2. Proof it works — tests that pass, manual steps with logs or screenshots for anything visual.
  3. Risk and AI role — what tier of change this is, and which parts were AI-generated.
  4. Review focus — the one or two places where human judgment actually matters. The reviewer's job then shifts from "hunt for problems" to "verify the evidence and apply judgment where the author flagged it as needed."

Tools that produce proof automatically. Some of this evidence can be generated before a human ever opens the PR. Linters and static analysis tools (Semgrep, SonarQube, PHPStan) catch entire classes of bugs without a reviewer's attention. Mutation testing — PIT, Infection — is the one most worth adding in the AI era: it answers the question "do these tests actually verify behaviour, or just execute lines?" High coverage with weak assertions is a common AI failure mode, and mutation testing is how you catch it.

The principle is the same as Willison's and Osmani's, just automated: proof attached to the PR before it asks for human attention.

Not every PR deserves the same review

Not every line of code deserves the same review attention. A language fix and a payment migration currently get reviewed with the same process — and both come out badly. The language fix takes too long, the payment migration gets too little.

Three questions for every change. Birgitta Böckeler7 proposes three questions for every change:

  1. How likely is AI to get this wrong? (probability)
  2. Would you ship it if you were on call tonight? (impact)
  3. Will failures be obvious if missed? (detectability)

Her two extreme tiers:

  • Low probability + low impact + high detectability is fine without review at all.
  • High probability + high impact + low detectability needs full attention.

In practice this maps to concrete examples. Highest tier: authentication, payments, secrets, untrusted input, database migrations. Lowest tier: internal tooling, prototypes, styling fixes.

This was a useful practice before AI. With AI, it's a survival mechanism: when low-risk code generates much faster, high-risk PRs drown in the noise unless they're explicitly flagged.

Disclosure makes tiering work. This is where Osmani's PR Contract earns its place. If the PR declares its risk tier in the description, the reviewer can prioritize their attention before opening the diff — full focus on a payment migration, quick scan on a styling fix.

The cost lands on seniors. Tiering has a cost, and senior engineers pay it. Faros1 notes that AI-generated code "looks like code written by someone who knows what they are doing" — but the structural problems are buried deeper. The engineers who can see those problems are the ones with deepest system knowledge, and they spend their best hours unraveling plausible-looking code that should never have reached them.

Amazon institutionalized exactly this in March 2026: after a series of high-impact incidents, junior and mid-level engineers must now obtain senior sign-off on any AI-assisted change.8 This is what happens when tiering isn't done deliberately — organizations end up with a heavy-handed rule that turns senior review into the next bottleneck.

Tiering only works if PR authors honestly classify the risk. And high-impact code shipped under low-tier labels is exactly the failure mode Amazon's policy was created to solve. The trade-off is real: trust authors to label risk and accept some misclassification, or route all AI code through seniors, make them review the low-tier code and turn them into the next bottleneck.

Verification after the merge

The physical limit. Review has a physical capacity. Human attention is finite and cannot scale. And when AI generates code far faster, some percentage of issues will slip past review no matter how careful the team is. Here comes a new philosophy: the question isn't "how do we catch everything before merge" — that's impossible. It's "how do we make the missed ones not a catastrophe." The answer is to move part of verification past the merge, into production. Anything that bypasses review has to fail cheaply — that's the precondition.

Honeycomb's manifesto. Charity Majors, CTO at Honeycomb, makes the case directly9:

Engineers learn about production the way people learn about plumbing: only when there's a problem that needs fixing.

By the time problems become visible, the distance from cause to effect has grown too large to debug. The person who wrote the code has moved on. Fast feedback loops — observability — are the only thing that scales when AI generates code faster than humans can review it.

The stack that makes this work. Moving verification past merge requires the right tooling underneath. Instrument code with rich, high-cardinality structured logs so AI-generated changes are visible in production traces. Deploy continuously behind feature flags (LaunchDarkly, Statsig) so new code ships disabled by default. Canary to a small percentage of users before full rollout. Monitor SLO burn rates so regressions trigger alerts before customers notice. And critically — automate rollback. If burn rate spikes, the system reverts without waiting for a human to decide.

The faster you ship, the better your brakes need to be.

Compliance. There's one real limit to leaning on observability: compliance. Standards like SOC 2 require documented human review of every production change, which doesn't sit cleanly with "we caught it in canary." Honeycomb's Liz Fong-Jones surfaces the open question directly10: "Is it OK for a human to stamp a PR based on a separate instance of Claude Opus having done a review of the code?" The industry hasn't answered this. For compliance-bound teams, observability shifts weight, not the whole load.

Three layers of defense. These three directions aren't alternatives. They're layers of the same defense. Push verification upstream, so less work reaches review. Tier the work so review attention goes where it matters. Move part of verification past the merge, so the missed ones don't become incidents. The teams that pick one and ignore the others are exactly the ones in the Faros telemetry — +441% time in review, +242% incidents per PR. That's what it looks like when only the middle layer is asked to do all the work.

The bottom line

Review broke not because of teams or people — it broke because of an asymmetry. AI generates code far faster, while review scales linearly with human attention. This is math, not culture. Blaming the team is pointless.

The data on this question diverges. DORA surveys say that mature practices protect against quality decay; Faros telemetry shows they didn't. When the question is "does review actually work?", look at the telemetry. Surveys measure how developers feel about their work, telemetry measures what actually happens in production.

None of these three directions is new. These are practices that existed long before AI. I can't say which of them will turn out to matter most. But I know exactly one thing — AI just makes ignoring them more expensive.

Written by me with AI assistance.

  1. Faros Research, "Ten takeaways from the AI Engineering Report 2026: The Acceleration Whiplash," April 12, 2026 — https://www.faros.ai/blog/ai-acceleration-whiplash-takeaways2

  2. SmartBear & Cisco, "Best Kept Secrets of Peer Code Review" (Cisco case study) — https://static0.smartbear.co/support/media/resources/cc/book/code-review-cisco-case-study.pdf

  3. Google Engineering Practices, "Small CLs" — https://google.github.io/eng-practices/review/developer/small-cls.html

  4. Tim Anderson, "OCaml maintainers reject massive AI-generated pull request," DevClass, November 27, 2025 — https://www.devclass.com/ai-ml/2025/11/27/ocaml-maintainers-reject-massive-ai-generated-pull-request/1728083

  5. Simon Willison, "Your job is to deliver code you have proven to work," December 18, 2025 — https://simonwillison.net/2025/Dec/18/code-proven-to-work/

  6. Addy Osmani, "AI writes code faster. Your job is still to prove it works," January 7, 2026 — https://addyosmani.com/blog/code-review-ai/

  7. Birgitta Böckeler, "To vibe or not to vibe," martinfowler.com, September 23, 2025 — https://martinfowler.com/articles/exploring-gen-ai/to-vibe-or-not-vibe.html

  8. "Amazon now requires senior engineers to sign off on AI code," The Decoder — https://the-decoder.com/amazon-makes-senior-engineers-the-human-filter-for-ai-generated-code-after-a-series-of-outages/

  9. Charity Majors, "Honeycomb 10 Year Manifesto: Observability in a World of AI-Generated Code," February 11, 2026 — https://www.honeycomb.io/blog/honeycomb-10-year-manifesto-part-1

  10. Charles Humble, "Shipping faster, thinking less? The AI code verification trap," LeadDev, April 9, 2026 — https://leaddev.com/ai/shipping-faster-thinking-less-the-ai-code-verification-trap