I ran the same `ms` bug audit five times. The bug counts came back 7, 4, 5, 3, then 9. Nine distinct bugs surfaced across the runs, but only one showed up every single time. The other eight were a coin flip.
Setup
The same multi-agent `ms` audit, same prompt, run five identical times
Measured
Bug counts of 7, 4, 5, 3, 9; nine distinct bugs total; only 1 stable across all five runs
Verdict
VERDICT: MIXED
Same AI. Same prompt. Five runs.
Bug count: 7, 4, 5, 3, then 9.
So which number do you trust?
A swarm of LLM agents is stochastic. One run is not a measurement. It is a single roll of the dice.
So instead of reporting a headline number, I ran five identical audits of ms, the tiny library used by millions, and reported the distribution.
Nine distinct bugs surfaced across the five runs.
The one that held up every single time was a genuine off-by-one: ms accepts a 100-character input while its own error message swears the limit is 99.
I am not saying AI is unreliable. The point is simpler than that:
One AI run is a sample, not an answer.
If you are trusting an agent's verdict on your work's quality, security, or correctness, you need to know you are trusting a single roll of the dice.
How many teams trusting AI to review their work have ever run it twice, just to see if it agrees with itself
Read the original on LinkedIn →
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Keep reading
I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.
I gave Claude Code one goal: audit the `ms` duration parser for bugs. It orchestrated about 34 hunt, verify, and report agents that took 22 candidates down to 11 verified and 8 confirmed real bugs. Twelve minutes, around 0.8M tokens, roughly $15. The failing inputs reproduce locally.
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.