Before AI: same input, same output, every time. After AI: same input, different output, every time. Same quality gates. After a year of daily Claude, Codex, and Gemini CLI use, these are the six gates I run on every AI-assisted task, and the five numbers I measure to know whether they work.
Setup
A year of daily Claude, Codex, and Gemini CLI use; six gates on every AI-assisted task
Measured
Output divergence, first-pass acceptance, rework hours, error escape rate
Verdict
VERDICT: MIXED
Before AI: same input, same output, every time.
After AI: same input, different output, every time.
Same quality gates.
That mismatch is the quiet failure mode in most AI-assisted engineering today. It took me a while to see it in my own workflow.
I have been using Claude, Codex, and Gemini CLI daily for over a year. Early on, I trusted the first output because it looked right. Syntactically clean, plausibly structured, confidently written. The result was bloated code that was neither testable nor understandable, and I only discovered that downstream, where fixing it is expensive.
Most teams made the same mistake at a larger scale. They bolted AI into workflows built for deterministic toolchains. A compiler gives you the same output for the same input every time, so the entire quality apparatus around it assumes one run equals one answer. LLMs break that assumption. Every output is a sample from a probability distribution.
The tool changed. The execution system did not. The gates still assume determinism while the generator underneath them is probabilistic.
These are the gates I now run on every AI-assisted task.
Run the same task twice. Compare the outputs. If they diverge on anything structural, that is your signal to inspect before going further. I call this the Rule of 2. It is the cheapest insurance you can buy: one extra generation costs tokens, but a structural disagreement you never saw costs a sprint.
Run the task through two different models, then have each critique the other's output. You will be surprised how often one model catches what the other missed. The models have different training distributions and different blind spots, and the disagreement surface is exactly where the risk concentrates.
Run the same prompt ten times. If outputs diverge more than 15% structurally, your prompt is underspecified. This reframes a lot of perceived "model unreliability" as input debt: fix the specification before blaming the generator.
No AI-generated code ships on a single pass. Two independent generations must agree on architecture and logic before a human even reviews it. If they disagree, the prompt needs work before anyone spends review time on it.
The larger the task, the wider the variance. A 20-line function is reviewable. A 500-line module is a lottery ticket. Decompose until each generation lands in territory a human can actually verify.
AI does not know your system's invariants. Type safety, API contract compliance, integration test coverage: these are your floor, not your ceiling. Everything the previous five gates pass still has to clear the checks your system would demand of any code, from any author.
Gates without measurement are theater. Five numbers tell you whether yours actually work:
The teams getting reliable output from AI are not the ones with better models. Everyone has access to the same frontier. They are the ones who rebuilt their execution system around how the tool actually behaves: probabilistic, variance-prone, and reliable only inside well-specified bounds.
If your quality gates still assume one run equals one answer, the gates are the legacy system.
Originally shared on LinkedIn.
Read the original on LinkedIn →
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Keep reading
I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.
I ran the same `ms` bug audit five times. The bug counts came back 7, 4, 5, 3, then 9. Nine distinct bugs surfaced across the runs, but only one showed up every single time. The other eight were a coin flip.
I gave Claude Code one goal: audit the `ms` duration parser for bugs. It orchestrated about 34 hunt, verify, and report agents that took 22 candidates down to 11 verified and 8 confirmed real bugs. Twelve minutes, around 0.8M tokens, roughly $15. The failing inputs reproduce locally.