AI FIELD TEST №10June 4, 2026 · 4 min readAI Agents Developer-Workflow

Same Prompt, Five Runs: 7, 4, 5, 3, 9

I ran the same `ms` bug audit five times. The bug counts came back 7, 4, 5, 3, then 9. Nine distinct bugs surfaced across the runs, but only one showed up every single time. The other eight were a coin flip.

Setup

The same multi-agent `ms` audit, same prompt, run five identical times

Measured

Bug counts of 7, 4, 5, 3, 9; nine distinct bugs total; only 1 stable across all five runs

Verdict

VERDICT: MIXED

Same AI. Same prompt. Five runs.

Bug count: 7, 4, 5, 3, then 9.

So which number do you trust?

A swarm of LLM agents is stochastic. One run is not a measurement. It is a single roll of the dice.

So instead of reporting a headline number, I ran five identical audits of ms, the tiny library used by millions, and reported the distribution.

The Distribution

Nine distinct bugs surfaced across the five runs.

1 was found in all 5 runs. Stable. A real signal.
8 showed up in only some runs. Flaky. A coin flip you happened to win.

The one that held up every single time was a genuine off-by-one: ms accepts a 100-character input while its own error message swears the limit is 99.

What This Is Not

I am not saying AI is unreliable. The point is simpler than that:

One AI run is a sample, not an answer.

If you are trusting an agent's verdict on your work's quality, security, or correctness, you need to know you are trusting a single roll of the dice.

What To Do Instead

The signal worth keeping is what survives repetition
Never trust the results blindly. You need to verify

How many teams trusting AI to review their work have ever run it twice, just to see if it agrees with itself

Read the original on LinkedIn →

VERDICT: MIXED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 8, 2026•4 min read

The Most Expensive Lie My Agent Tells Isn't in the Code

I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.

AI · Agents · Developer-Workflow · +1 more

Jun 3, 2026•5 min read

I Pointed an AI Swarm at an npm Package Used by Millions

I gave Claude Code one goal: audit the `ms` duration parser for bugs. It orchestrated about 34 hunt, verify, and report agents that took 22 candidates down to 11 verified and 8 confirmed real bugs. Twelve minutes, around 0.8M tokens, roughly $15. The failing inputs reproduce locally.

AI · Agents · Developer-Workflow

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more