AI FIELD TEST №04March 23, 2026 · 6 min readAI Engineering-Leadership Quality

Six Quality Gates for AI-Assisted Engineering

Before AI: same input, same output, every time. After AI: same input, different output, every time. Same quality gates. After a year of daily Claude, Codex, and Gemini CLI use, these are the six gates I run on every AI-assisted task, and the five numbers I measure to know whether they work.

Setup

A year of daily Claude, Codex, and Gemini CLI use; six gates on every AI-assisted task

Measured

Output divergence, first-pass acceptance, rework hours, error escape rate

Verdict

VERDICT: MIXED

Before AI: same input, same output, every time.

After AI: same input, different output, every time.

Same quality gates.

That mismatch is the quiet failure mode in most AI-assisted engineering today. It took me a while to see it in my own workflow.

How I Got Here

I have been using Claude, Codex, and Gemini CLI daily for over a year. Early on, I trusted the first output because it looked right. Syntactically clean, plausibly structured, confidently written. The result was bloated code that was neither testable nor understandable, and I only discovered that downstream, where fixing it is expensive.

Most teams made the same mistake at a larger scale. They bolted AI into workflows built for deterministic toolchains. A compiler gives you the same output for the same input every time, so the entire quality apparatus around it assumes one run equals one answer. LLMs break that assumption. Every output is a sample from a probability distribution.

The tool changed. The execution system did not. The gates still assume determinism while the generator underneath them is probabilistic.

The Six Gates

These are the gates I now run on every AI-assisted task.

1. Redundant Generation

Run the same task twice. Compare the outputs. If they diverge on anything structural, that is your signal to inspect before going further. I call this the Rule of 2. It is the cheapest insurance you can buy: one extra generation costs tokens, but a structural disagreement you never saw costs a sprint.

2. Cross-Model Adversarial Review

Run the task through two different models, then have each critique the other's output. You will be surprised how often one model catches what the other missed. The models have different training distributions and different blind spots, and the disagreement surface is exactly where the risk concentrates.

3. Output Stability Testing

Run the same prompt ten times. If outputs diverge more than 15% structurally, your prompt is underspecified. This reframes a lot of perceived "model unreliability" as input debt: fix the specification before blaming the generator.

4. Confidence Thresholds Before Merge

No AI-generated code ships on a single pass. Two independent generations must agree on architecture and logic before a human even reviews it. If they disagree, the prompt needs work before anyone spends review time on it.

5. Scope-Bounded Generation

The larger the task, the wider the variance. A 20-line function is reviewable. A 500-line module is a lottery ticket. Decompose until each generation lands in territory a human can actually verify.

6. Domain-Specific Validation

AI does not know your system's invariants. Type safety, API contract compliance, integration test coverage: these are your floor, not your ceiling. Everything the previous five gates pass still has to clear the checks your system would demand of any code, from any author.

What to Measure

Gates without measurement are theater. Five numbers tell you whether yours actually work:

Output divergence rate. Same prompt, 50 tasks, run twice each. Above 15% divergence means you have prompt-engineering work to do before you blame the model.
First-pass acceptance rate. What percentage passes review without rework? Below 60% means you are missing a gate.
Rework hours per AI task. If rework exceeds 40% of the manual effort, you added a step. You did not save time.
Error escape rate. AI-introduced defects that reach staging or production. This number tells you whether your gates work or just feel rigorous.
Cost of redundancy vs. failure. Running a task twice costs 2x the tokens. Shipping a hallucinated architecture costs 2x the sprint. The asymmetry is the whole argument.

The Real Lesson

The teams getting reliable output from AI are not the ones with better models. Everyone has access to the same frontier. They are the ones who rebuilt their execution system around how the tool actually behaves: probabilistic, variance-prone, and reliable only inside well-specified bounds.

If your quality gates still assume one run equals one answer, the gates are the legacy system.

Originally shared on LinkedIn.

Read the original on LinkedIn →

VERDICT: MIXED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 8, 2026•4 min read

The Most Expensive Lie My Agent Tells Isn't in the Code

I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.

AI · Agents · Developer-Workflow · +1 more

Jun 4, 2026•4 min read

Same Prompt, Five Runs: 7, 4, 5, 3, 9

I ran the same `ms` bug audit five times. The bug counts came back 7, 4, 5, 3, then 9. Nine distinct bugs surfaced across the runs, but only one showed up every single time. The other eight were a coin flip.

AI · Agents · Developer-Workflow

Jun 3, 2026•5 min read

I Pointed an AI Swarm at an npm Package Used by Millions

I gave Claude Code one goal: audit the `ms` duration parser for bugs. It orchestrated about 34 hunt, verify, and report agents that took 22 candidates down to 11 verified and 8 confirmed real bugs. Twelve minutes, around 0.8M tokens, roughly $15. The failing inputs reproduce locally.

AI · Agents · Developer-Workflow