Published
March 23, 2026
AI
Before AI: same input, same output, every time. After AI: same input, different output, every time. Same quality gates. After a year of daily Claude, Codex, and Gemini CLI use, these are the six gates I run on every AI-assisted task, and the five numbers I measure to know whether they work.
Published
March 23, 2026
Reading time
6 min read
Author
Fadi Labib

Before AI: same input, same output, every time.
After AI: same input, different output, every time.
Same quality gates.
That mismatch is the quiet failure mode in most AI-assisted engineering today, and it took me a while to see it in my own workflow.
I have been using Claude, Codex, and Gemini CLI daily for over a year. Early on, I trusted the first output because it looked right. Syntactically clean, plausibly structured, confidently written. The result was bloated code that was neither testable nor understandable, and I only discovered that downstream, where fixing it is expensive.
Most teams made the same mistake at a larger scale. They bolted AI into workflows built for deterministic toolchains. A compiler gives you the same output for the same input, every time, so the entire quality apparatus around it assumes one run equals one answer. LLMs break that assumption. Every output is a sample from a probability distribution.
The tool changed. The execution system did not. The gates still assume determinism while the generator underneath them is probabilistic.
These are the gates I now run on every AI-assisted task.
Run the same task twice. Compare the outputs. If they diverge on anything structural, that is your signal to inspect before going further. I call this the Rule of 2. It is the cheapest insurance you can buy: one extra generation costs tokens, while a structural disagreement you never saw costs a sprint.
Run the task through two different models, then have each critique the other's output. You will be surprised how often one model catches what the other missed. The models have different training distributions and different blind spots, and the disagreement surface is exactly where the risk concentrates.
Run the same prompt ten times. If outputs diverge more than 15% structurally, your prompt is underspecified. This reframes a lot of perceived "model unreliability" as input debt: fix the specification before blaming the generator.
No AI-generated code ships on a single pass. Two independent generations must agree on architecture and logic before a human even reviews it. If they disagree, the prompt needs work, not the reviewer's patience.
The larger the task, the wider the variance. A 20-line function is reviewable. A 500-line module is a lottery ticket. Decompose until each generation lands in territory a human can actually verify.
AI does not know your system's invariants. Type safety, API contract compliance, integration test coverage: these are your floor, not your ceiling. Everything the previous five gates pass still has to clear the checks your system would demand of any code, from any author.
Gates without measurement are theater. Five numbers tell you whether yours actually work:
The teams getting reliable output from AI are not the ones with better models. Everyone has access to the same frontier. They are the ones who rebuilt their execution system around how the tool actually behaves: probabilistic, variance-prone, brilliant inside well-specified bounds and dangerous outside them.
If your quality gates still assume one run equals one answer, the gates are the legacy system.
Originally shared on LinkedIn.
Keep reading

I asked Claude and ChatGPT to design a hunting game with a gun controller. I got an €86 bill of materials, a 16-month plan, and a €237,000 launch budget. In 1984, Nintendo solved the same problem with a photodiode, a comparator circuit, and a screen flash. One model even name-checked Duck Hunt in its first sentence, then designed a Wii anyway.

I ran Anthropic's AI-written C compiler through my novelty-scoring pipeline, expecting to confirm my public position that GenAI can't do systems programming. The data forced me to retune my own metrics. What I found instead was a sharper question for anyone running an engineering team: what's your ratio?

Two research papers from Google and DeepSeek landed in October from completely different domains. One processes speech, the other processes documents. Neither bothers converting anything to text first. This exposes something fundamental about how we have been training perception systems for decades.