Six Quality Gates for AI-Assisted Engineering

Before AI: same input, same output, every time.

After AI: same input, different output, every time.

Same quality gates.

That mismatch is the quiet failure mode in most AI-assisted engineering today, and it took me a while to see it in my own workflow.

How I Got Here

I have been using Claude, Codex, and Gemini CLI daily for over a year. Early on, I trusted the first output because it looked right. Syntactically clean, plausibly structured, confidently written. The result was bloated code that was neither testable nor understandable, and I only discovered that downstream, where fixing it is expensive.

Most teams made the same mistake at a larger scale. They bolted AI into workflows built for deterministic toolchains. A compiler gives you the same output for the same input, every time, so the entire quality apparatus around it assumes one run equals one answer. LLMs break that assumption. Every output is a sample from a probability distribution.

The tool changed. The execution system did not. The gates still assume determinism while the generator underneath them is probabilistic.

The Six Gates

These are the gates I now run on every AI-assisted task.

1. Redundant Generation

Run the same task twice. Compare the outputs. If they diverge on anything structural, that is your signal to inspect before going further. I call this the Rule of 2. It is the cheapest insurance you can buy: one extra generation costs tokens, while a structural disagreement you never saw costs a sprint.

2. Cross-Model Adversarial Review

Run the task through two different models, then have each critique the other's output. You will be surprised how often one model catches what the other missed. The models have different training distributions and different blind spots, and the disagreement surface is exactly where the risk concentrates.

3. Output Stability Testing

Run the same prompt ten times. If outputs diverge more than 15% structurally, your prompt is underspecified. This reframes a lot of perceived "model unreliability" as input debt: fix the specification before blaming the generator.

4. Confidence Thresholds Before Merge

No AI-generated code ships on a single pass. Two independent generations must agree on architecture and logic before a human even reviews it. If they disagree, the prompt needs work, not the reviewer's patience.

5. Scope-Bounded Generation

The larger the task, the wider the variance. A 20-line function is reviewable. A 500-line module is a lottery ticket. Decompose until each generation lands in territory a human can actually verify.

6. Domain-Specific Validation

AI does not know your system's invariants. Type safety, API contract compliance, integration test coverage: these are your floor, not your ceiling. Everything the previous five gates pass still has to clear the checks your system would demand of any code, from any author.

What to Measure

Gates without measurement are theater. Five numbers tell you whether yours actually work:

Output divergence rate. Same prompt, 50 tasks, run twice each. Above 20% divergence means your prompts need engineering, not your model.
First-pass acceptance rate. What percentage passes review without rework? Below 60% means you are missing a gate.
Rework hours per AI task. If rework exceeds 40% of the manual effort, you added a step. You did not save time.
Error escape rate. AI-introduced defects that reach staging or production. This number tells you whether your gates work or just feel rigorous.
Cost of redundancy vs. failure. Running a task twice costs 2x the tokens. Shipping a hallucinated architecture costs 2x the sprint. The asymmetry is the whole argument.

The Real Lesson

The teams getting reliable output from AI are not the ones with better models. Everyone has access to the same frontier. They are the ones who rebuilt their execution system around how the tool actually behaves: probabilistic, variance-prone, brilliant inside well-specified bounds and dangerous outside them.

If your quality gates still assume one run equals one answer, the gates are the legacy system.

Originally shared on LinkedIn.