← All posts

Don't trust, verify

Human review is excellent at finding things that don't matter and not so strong at finding things that do

Two kinds of design-review feedback. The first: "your power rails should all point up and your grounds all point down." Schematic tidiness. Convention. Nobody's circuit has ever stopped working because the supply symbols were drawn the "wrong" way. Pull requests attract the same first instinct: the variable name doesn't match the style guide, the arguments are in the "wrong" order, there's a comma where the reviewer would put a semicolon. The second kind of feedback, from someone who'd actually run the analysis: "that opamp is going to resonate." Phase margin at the unity-gain crossover was under thirty degrees -- not visible anywhere in the schematic, invisible to any amount of reading, only reachable by sitting down with the transfer function and doing the frequency-domain maths. The first kind of feedback is what design review reliably produces. The second is what it reliably misses.

That asymmetry is the whole story. Review -- someone reading your work and telling you what they see -- is structurally suited to the surface and structurally blind to the consequential. And most engineering teams treat review as the primary quality gate, sometimes the only one. This is a mistake that follows directly from review's single great advantage: it is nearly free to set up. And "nearly free to set up" is a more powerful attractor than most teams admit.

The cheap half

Review needs almost nothing to start. A pair of eyes, a diff, an hour. No infrastructure, no tooling, no test rigs, no procurement lead time. You can run a review the day the team is formed, in a coffee shop, before you've written a single line of build configuration. That accessibility is genuine and worth having. It is also exactly the problem.

Verification costs. In hardware, the costs are physical and hard to ignore. You need test hardware to actually exist -- which means parts, lead times, and someone whose job it isn't having to order them. You need hardware-in-the-loop rigs that stay dedicated, whose components don't disappear overnight because someone needed a power supply for an unrelated thing that morning. You need oscilloscopes, logic analysers, power supplies, maybe a climate chamber to verify over temperature. You need people who maintain the rigs and understand what they're actually checking. In software the costs look different but they're equally real: automated test harnesses, CI infrastructure, integration environments that mirror production closely enough to be trustworthy. For embedded and distributed systems, a great deal of that verification can run under emulation or simulation -- Renode handles embedded targets and multi-node setups, QEMU handles full machine emulation -- software-defined, so anyone can spin one up locally and CI can run thirty at once. Emulation is less realistic than real hardware; the advantage is availability and low friction. The biggest payoff: it works before the board exists, so software verification proceeds concurrently with the hardware design. At the deep end, where only real hardware proves the point, the CI pipeline reaches into a rack of test and measurement hardware, each instrument connected to the device under test for a different purpose: software-scriptable power relays to cycle and control power, logic analysers and picoscopes to capture signals, serial diagnostic ports and JTAG to control and inspect the device state directly. Together they give the verification process the measurement and control over real hardware that emulation cannot substitute -- this is the measurement axis made physical, and you cannot verify what you cannot measure. These runs take an hour; they need babysitting when something environmental flakes.

Review is cheap by comparison. So teams over-invest in it -- not because they reasoned their way to review being the right tool, but because paths of least resistance are where organisations arrive without actively steering. The word for it is lazy. Precise sense, not pejorative: lazy as in the natural state of a system seeking its lowest-energy configuration. There is always institutional pressure in the direction of review over verification because review imposes costs that don't show up anywhere -- no budget line, no capital request, no procurement approval -- while verification's costs are visible and must be justified.

The lopsidedness is self-defeating. Review is cheap precisely because it is a weak proxy for reality. Verification is expensive precisely because it reproduces reality. The gap in cost tracks the gap in fidelity, and fidelity is the whole point.

There is a second layer of cost in verification that shows up after you have acquired the tooling: the ergonomics of actually running it. Verification requires that a developer can spin up the relevant components and wire them together -- or deploy to hardware and interface with it -- without ceremony. Every extra step between a developer and a test is friction, and friction is a tax paid by not verifying. When those extra steps are also unwritten, known only to one or two people, the tax becomes a wall: only those people can pay it, which means only those people verify, which means quality assurance becomes a bottleneck rather than a habit -- or, more often, a pipe dream that rarely gets done at all.

Every unwritten step between a developer and a test is a reason the test won't get run.

A backend team I worked on had excellent local tooling: a developer could spin up multiple services with pre-populated test data and run scripted scenarios entirely on their own laptop, start to finish, without needing access to shared infrastructure. A data team I worked with eventually built one-click ephemeral test environments for their data-pipeline and orchestration stack -- the whole thing, provisioned and torn down on demand. In both cases, verifying became the default because it was nearly free to do. The contrasting picture -- which appears at many companies -- is "deploy for testing" as a fragile, entirely manual process that lives in one or two people's heads. It rarely happens. Quality leaks accordingly.

What eyeballs catch (and don't)

Review optimises for what the reviewer can personally assess. And what a reviewer can personally assess, without additional tooling or domain analysis, is the surface: naming, formatting, structure, style, whether the code matches the reviewer's mental model of what the code should look like. These things have value. They are rarely the things that cause production incidents.

A reviewer I worked with once blocked a merge for three days over a rename -- "semantically identical," they called it. The rename was in fact semantically different, introduced downstream confusion, and had to be reverted six weeks later when the confusion materialised into a real bug. The same reviewer approved the accompanying algorithmic change without a comment. The algorithm had a fault -- not in the naming, in the logic. The reviewer saw the naming because naming is on the surface. They didn't see the logic error because logic errors require simulation or proof to find, not reading. Review can confirm that the code is plausible. It cannot confirm that the code is correct.

A related pattern, more insidious: three different names for the same physical constant living in the same hundred lines of code, all undocumented, one of them wrong by a factor of a thousand. A reviewer flagged one variable -- non-standard abbreviation, fair point. The factor-of-a-thousand error sat in a different variable with an unremarkable name, and nobody caught it because everyone assumed someone else had verified the actual number. Nobody had, because verification was not anyone's explicit job; review was.

There is also a version of review that collapses entirely into taste -- and the tell is when taste is presented as correctness. "This is wrong" where "wrong" means "not how I would have written it." A well-known retail maxim, in its fuller form, runs: the customer is always right, in matters of taste. The code-review mutation is: the reviewer is always right, in matters of taste. That's structural, not just social: a taste nit must be resolved before approval, and approval is the merge gate, so the reviewer always wins on style regardless of how low the actual risk. The same dynamic appears when a reviewer rejects a change for being "too long" -- a judgement about what the reviewer can comfortably hold in their own head, not an assessment of whether the change is correct. The codebase has conventions, the team has style, and a reviewer is the right customer of those preferences. But correctness is not a matter of preference and cannot be settled by approval. Whether the opamp oscillates is not a question anyone's opinion resolves. Approval means the diff was read; it does not mean the behaviour was checked. Only verification settles that.

The faults that don't show up in a diff

The bugs that matter live in behaviour, not text -- and you can't read behaviour.

Some bugs are simply invisible to reading. Reading is the wrong instrument for what these bugs are made of -- and no amount of reviewer skill changes that.

On a team I worked on, a hand-rolled event loop had a concurrency fault that appeared only under a specific scheduling interleave: one thread advancing a state machine past a checkpoint while another thread was still acting on the pre-checkpoint state. The code looked completely reasonable. The logic was locally correct in every function. The problem was in the timing, which lived in the dynamic interaction between the code and the scheduler -- not in the code itself. A reviewer could read those files a hundred times and find nothing. A stress harness running for five minutes against a real scheduler would have found it on the first run.

There is a related category: documented operating-system behaviour that the author treated as unexpected. Signal delivery interrupting a blocking system call and the call returning EINTR is specified in the standard. Every mature systems library handles it; it is nonetheless completely surprising to an engineer who hand-rolled an event loop without close attention to the relevant manpages. The fix is a retry loop. The debugging, without knowing to look for EINTR, runs to weeks. No reviewer caught it because nothing in the code signals the omission -- the bug is in the contract between the code and the operating system, and you only see the violation if you know the contract.

The culture that produces these bugs is one that trusts cleverness over field evidence. A team with that culture dismisses the industry-standard event loop or socket library as "rubbish" -- but the industry-standard implementation has been verified at scale by years of production deployments, and the home-grown replacement has been tested by three engineers for a week. That difference is enormous, and review cannot see it. Review reads implementations; it cannot run them. A single socket code path trying to be both blocking and non-blocking, a hand-rolled scheduler that reinvents the problems the OS already solves -- these look like ingenuity in a diff. In production, under load, they look like incidents.

Hardware produces the same class of fault. More than once I have seen a magnetometer assembled ninety degrees from its intended orientation -- schematic correct, layout clean, fault invisible to both. The defect only surfaces when a real board is in hand. Without a verification harness to localise it, software and hardware teams spent three days trading blame while the schedule slipped. Review produces an opinion about blame; verification produces the answer.

The complementary case: a magnetometer placed next to PWM-switched high-current traces, whose switching fields corrupt the reading. A reviewer could have caught it in the layout -- but only by stopping to calculate, not by admiring the routing. Once a check requires working through numbers, the reviewer has crossed from reading into verifying.

The three-axes article I wrote recently treats the test axis in full; the point here is the mechanism: a culture that defers to reading rather than running accumulates these faults invisibly, because the faults are invisible to reading.

Tests are requirements you can run

Writing a test for a requirement is, structurally, the same thing as writing the requirement -- except the test can be executed. Documentation states what the system should do. A test verifies whether it does. Documentation can be believed or doubted. A test passes or fails. That's not a small difference; it's the entire gap between review-think and verification-think.

Every project has requirements, whether or not anyone has written them down. Writing them down doesn't create the requirements -- it creates a single visible source of truth that lives outside one person's head, so anyone on the team can answer "does this work or not?" without tracking down the person who holds the spec in memory. The more consequential question is what you do with them once they're visible. Verification tests whether the requirements are met directly -- the check runs, the system produces an output, it either matches the expected behaviour or it doesn't. Review tests someone's opinion of whether they're met -- and that opinion-testing typically travels together with requirements that haven't been written down in the first place. Unwritten requirements assessed by review is doubly weak: a vague target, judged by taste. Written requirements checked by verification is the opposite: a clear target, tested against reality.

Program testing can be used to show the presence of bugs, but never to show their absence.

-- Edsger W. Dijkstra, Notes on Structured Programming (1970)

The Dijkstra caveat is important and honest: verification is necessary, not sufficient. A passing test suite does not prove correctness; it proves that the specific scenarios the tests encode are currently passing. That's genuinely less than proof.

Counting tests is review-think. "You only have twenty tests -- that's a negative mark" is the same move as "that variable name doesn't match my conventions" -- a surface metric, optimisable without improving the thing the metric represents. Twenty tests that each verify a failure mode observed in production are worth more than two hundred tests that verify happy paths everyone already knew worked. The question isn't the count; it's whether the tests reach the failure modes that actually occur. An integration test with retry logic running against the real transport stack, reproducing the actual flake discovered last month, is doing something no unit test can do: it's reaching behaviour. Behaviour is what fails in the field.

Tests that reach behaviour are also the only documentation that goes stale visibly. Docs that describe system behaviour nobody validates against reality can drift indefinitely -- and they do. A test that encodes expected behaviour runs every build, and the first time behaviour diverges, the build tells you. The test suite is the living specification.

A fully disciplined project makes the chain explicit. Requirements are the single source of truth: code implements them, tests verify them. In V-model development this correspondence is maintained at every level -- each requirement tier has a verification tier, and the mapping is a first-class artefact rather than an assumption. Forward completeness means every requirement is realised in code and covered by at least one test. A requirement with no implementation is unfinished. A requirement with no covering test is a claim nobody has checked.

The discipline runs in the other direction too. Every code path and every test should map back to a requirement. An orphan -- code with no requirement ancestry, or a test that claims to verify something not in the requirement set -- is a defect signal. Either a real requirement exists in someone's head but hasn't been written down, or something got built that nobody consciously decided the system should do. Orphan tests are particularly dangerous: a test that passes but verifies the wrong thing trains engineers to dismiss failures without reading them carefully. No orphans in either direction is the standard.

Most of this is machine-checkable. Scripts can confirm there are no code orphans, no test orphans, and no requirements lacking an implementing path and covering test. A coverage tool -- lcov or equivalent -- confirms the code is actually exercised by the tests that claim to reach it. These are checks that run, not a committee that reads and agrees. The one piece that genuinely requires judgement is whether the tests faithfully represent the requirements: a test can pass and still be verifying the wrong behaviour. That narrow band -- the faithfulness question -- is exactly where human review, and increasingly agentic review, earns its place.

Review the requirements. Verify the implementation.

The verification feedback loop

Traditional aerospace programmes spend years on intensive design review and analysis before a launch, and the vehicle can still fail to reach orbit. SpaceX blew up dozens of rockets -- sometimes on the pad -- and now launches more mass to orbit than the rest of the world put together. Each flight on a fast iteration cycle is real-world measurement at a rate that years of meeting-room review cannot match. Failure is data.

SpaceX kept its documentation, process, and review. What it rebuilt was the loop: the whole organisation oriented around a tight development feedback loop driven by real-world verification. That let it accumulate domain knowledge at a pace that made no economic sense on traditional timescales -- because the cost per experiment was low enough to run many of them. The cadence and the knowledge compounded together.

Verification is the sensor and the error signal that closes that loop. The test axis is where this happens at the level of a software or hardware programme. Review, on its own, runs open-loop -- plausible output, no convergence guarantee.

The expensive half is getting cheap

The cost of verification has always had two components: the cost of the tooling, and the cost of the judgement required to specify the checks correctly. The first is dropping fast -- hardware is cheaper, CI infrastructure is commoditised, the whole stack from test harness to results aggregation is substantially better than it was a decade ago. The second component -- knowing what to check and how -- is where agentic AI is genuinely useful, in a way that ordinary review-assistance is not.

Agentic review is a different thing from human review, not a cheaper substitution. It doesn't get bored, doesn't bikeshed naming conventions, and can glue tools together to verify gaps that classical automation can't reach: cross-referencing a schematic against a datasheet and flagging mismatches, tracing a signal path through a design and computing the stability margin automatically, running a corpus of field failures against a new design to check which failure modes are still reachable. It can also be wrong in ways that are hard to notice -- which is itself a verification problem. The right question for any AI-generated output is "does it actually work?" Not "does it match my taste?" Dismissing output because it doesn't match an unwritten house style is review-think. Checking whether it produces the right answer is verification-think.

This is the specific problem we're building for in electronics with fastest.ee: a verification engine that encodes industry standards and physics into checks that run -- electrical rule checks, power-integrity analysis, part-compliance, interface and architecture validation -- plus agentic checks for paying users that fill the gaps classical rules can't automate. The tagline is "be the fastest EE." The argument for it is exactly the argument above.

A reviewer sharp enough to catch the opamp's resonance in a diff -- someone who has internalised enough control theory to compute phase margin while reading a schematic -- is valuable, rare, and spending their time on the wrong job if reading schematics is most of what they do. Even when the schematic is correct, the physical layout can introduce feedback parasitics that exist nowhere in the diff -- a trace too close to a sensitive node, a ground return deflected by a pour cut -- and these defects surface only when the boards arrive, after the money has been spent and the lead time waited. Review alone is unlikely to catch them; finding them through review requires a design engineer who documents every requirement from the application notes as a checklist while working, and then a reviewer who works through each item systematically. Systematic checking against documented requirements is verification by another name. A system where the instability is caught automatically, in the tooling, before the board is spun -- because the stability criterion was encoded into a check that runs every time -- means the opamp cannot get past you regardless of who reviews the schematic or how carefully. The reviewer's attention is then free for what only human judgement can reach: the architectural trade-off that no rule covers, the requirement nobody has written down yet, the systemic risk that only appears when you look across the whole design rather than at any one part. That is a narrower band than most teams reserve for review. It is also the band where review is genuinely irreplaceable, because those questions don't have right answers you can encode, only better and worse judgements. The way to reach that narrower, more valuable band is to push everything else out of it, into verification, into checks that run. Rail conventions belong there. The resonance belongs there. What's left -- what only a person can do -- is worth having a person do.