Verification-First Engineering

May 25, 2026 • Alex Serban

7 / 9 • Verification

Intent

Make verification the dominant control loop around each agent step through executable checks, diff review, and scrutiny of unnecessary complexity.

Problem

Agent output is often convincing before it is correct. Human intuition alone is not a sufficient filter, especially when defects are semantic rather than syntactic.

Pattern

Make verification the dominant control loop around every agent step. This includes executable checks, review of the actual diff, and scrutiny of unnecessary complexity. The key question is not whether the agent claims the task is complete, but whether independent checks support that claim.

Correct Use

Run the cheapest discriminating validation immediately after each edit, review the resulting changes, and reject outputs that cannot be explained clearly.

Failure Mode If Skipped

Plausible but wrong code enters the codebase because no stage in the workflow was designed to falsify it.

Rationale

Verification should be driven by explicit success criteria rather than trust in the initial output. The value lies in making the agent iterate against independent checks.

Common verification targets include:

Syntactically correct but semantically wrong code
Plausible-looking but incorrect solutions
Code that works in isolation but breaks integration
Hallucinated APIs or non-existent functions

The main issue is not only that agents can be wrong. It is that they are often wrong in ways that look locally plausible. Verification therefore needs to be designed as a mechanism for falsification rather than as a final sanity check.

How to Apply the Pattern

Verification should be specified before trust is granted.

Define objective success criteria for the task.
Choose the cheapest checks that could disconfirm the current output.
Run those checks immediately after each edit or step.
Review the actual diff and resulting complexity, not only the command output.
Continue only when the current slice is independently supported.

In practice, this means verification is not a single late-stage activity. It is the main control loop around execution.

Supporting Practices

Use Layered Verification

Different checks catch different classes of failure.

Layer	Mechanism	When	Catches
Syntax	Parse or compile	Every edit	Syntax errors
Static	Linting, type checking	Every edit	Type errors, style issues
Unit	Automated tests	After implementation	Logic errors
Integration	System tests	After integration	Interface errors
Complexity	Code review, metrics	After implementation	Abstraction bloat, over-engineering
Cleanup	Diff review, dead code analysis	After implementation	Leftover code, unintended changes
Human	Code review	Before merge	Design issues, edge cases

Design Verification to Catch Common Failure Modes

Failure Mode	Description	Why It Happens
Assumption Propagation	Model misunderstands early, builds entire feature on faulty premises	You don’t notice until 5 PRs deep and architecture is cemented
Abstraction Bloat	1,000 lines where 100 would suffice	Agents optimise for looking comprehensive, not maintainability
Dead Code Accumulation	Old implementations linger, unrelated code altered	Agents don’t clean up after themselves
Sycophantic Agreement	No pushback on incomplete requirements	Enthusiastic execution of whatever you described

Keep the Verification Loop Tight

Agent Output
     │
     ▼
┌────────────┐   fail   ┌──────────────┐
│   Parse    │────────▶   Fix Syntax   
└─────┬──────┘          └──────┬───────┘
     │ pass                    │ retry
     ▼                         └──────────↺
┌────────────┐   fail   ┌──────────────┐
│    Lint    │────────▶   Fix Issues  
└─────┬──────┘          └──────┬───────┘
     │ pass                    │ retry
     ▼                         └──────────↺
┌────────────┐   fail   ┌──────────────┐
│    Test    │────────▶   Fix Logic   
└─────┬──────┘          └──────┬───────┘
     │ pass                    │ retry
     ▼                         └──────────↺
 Verified

Define Explicit Success Criteria

Use objective conditions to determine whether a task is complete.

## Success Criteria for This Task

- [ ] All existing tests pass
- [ ] New tests cover the added functionality  
- [ ] No linting errors
- [ ] No type errors
- [ ] Diff shows only intended changes (no dead code)
- [ ] Solution uses ≤ N lines / ≤ N files (complexity gate)

The task is complete only when all criteria are satisfied.

Guard Against Requirement Drift and Sycophantic Testing

AI agents often produce an initial result quickly, but the remaining work still includes edge cases, subtle bugs, security considerations, and architectural coherence. In some cases, the final portion of the task takes longer than expected because the developer must debug code they did not write and do not fully understand.

A related failure mode is sycophantic testing: agents write tests that match the implementation they just produced rather than tests that validate the requirements. The result may be a test suite with high coverage but low validation value.

Practical implications for verification:

Write tests before asking the agent to implement (test-driven development prevents sycophantic tests)
Build test suites that verify behaviour, not implementation (integration tests, property-based tests, scenario tests)
When reviewing AI-generated tests, ask: does this test validate the requirement, or does it just confirm what the code currently does?

Review Failure Modes Systematically

Failure Mode	Likely Cause	Fix
Agent writes code ignoring project conventions	CLAUDE.md/config doesn’t capture conventions, or agent didn’t read it	Make config more explicit; start session with confirmation of conventions
Agent “solves” the problem unexpectedly	Underspecified task; agent optimised for a proxy metric	Add explicit constraints and negative requirements to the spec
Agent gets confused mid-session	Context window degradation; early context loses influence	Audit current context, start a fresh session, or re-inject key constraints
Agent breaks something unrelated	Insufficient test coverage; scope too broad	Run full test suite after every task; narrow task scope

Keep a Human Verification Checklist

Automated checks catch syntax and logic errors, but human review is the only mechanism that catches design issues, unintended scope changes, and the subtle failure modes listed above. Apply this checklist before accepting any non-trivial agent output:

Check	How	Why
Read the code	Line by line, not just skim	Agents produce plausible-looking but semantically wrong code
Run it yourself	Execute, test edge cases	“It should work” ≠ it works
Review the diff	`git diff` — what actually changed?	Catch unintended modifications and dead code accumulation
Understand the approach	Can you explain it to someone else?	If you can’t explain it, you can’t maintain it
Question complexity	Could this be simpler?	Push back on abstraction bloat proactively
Verify requirements met	Check against the original specification	Agents solve what they understood, not necessarily what you meant

Code that cannot be explained clearly should not be accepted without further review. Accepting code that is not understood creates a maintenance liability.

Practices

Run static checks automatically after each agent edit
Write tests before asking agents to implement
Ask agents to verify their own work (“check if this compiles”)
Use multiple verification methods in combination
Automate verification in CI/CD
Verify early so assumption propagation is caught before it compounds
Check whether the solution could be simpler
Review cleanup explicitly by checking the diff for unintended changes and dead code
Use objective criteria such as tests and linters rather than subjective impressions

Anti-patterns

Accepting agent code without running it
Skipping tests because the code appears plausible
Verifying only at the end of a large change
Trusting claims that code “should work” without independent checks
Not reviewing diffs for unintended changes
Accepting complex solutions without questioning whether they can be simplified
Relying on agent self-assessment

Indicators

Share of tasks with explicit success criteria before implementation begins
Time between an edit and its first independent verification step
Frequency of defects caught by diff review, complexity review, or human review after tests pass
Rate of regressions caused by changes that were never independently checked

7 / 9 • Verification