I gave Claude Code one goal: audit the `ms` duration parser for bugs. It orchestrated about 34 hunt, verify, and report agents that took 22 candidates down to 11 verified and 8 confirmed real bugs. Twelve minutes, around 0.8M tokens, roughly $15. The failing inputs reproduce locally.
Setup
Claude Code, one prompt: "audit this library for bugs," pointed at the `ms` npm package
Measured
22 candidates to 11 verified to 8 confirmed bugs, in ~12 min for ~0.8M tokens (~$15)
Verdict
VERDICT: PASSED
The target was ms, the tiny duration parser behind calls like ms('2 days'), buried in the dependency tree of countless Node projects. The kind of code you would assume is bug-free.
I gave Claude Code one goal:
"audit this library for bugs."
Then I watched it orchestrate a multi-agent workflow.
Hunt. Twelve specialist agents launched in parallel, each assigned exactly one class of bug: off-by-one, rounding, overflow, regex traps, parsing edge cases. Not one generalist skimming everything. Twelve narrow lenses.
Verify. Every candidate went to a fresh skeptic agent, told to assume the bug was bogus until it could produce a real, reproducible failure. This is the part that matters. It throws out the false findings a single confident AI would happily report.
Report. Only the survivors, each with an exact repro and file:line.
22 candidates -> 11 verified -> 8 distinct confirmed bugs
A few of them:
-1 days12mo instead of 1yI did not trust the agents blindly. I ran the failing inputs locally, and they reproduce.
Then I had it cross-check against GitHub issues and PRs: one bug already has open fix PRs upstream, and two I could not find reported anywhere. Filing those is the next step.
The whole sweep was:
34 agents -> ~12 minutes -> one prompt
The cost of "AI finds bugs" is remarkably cheap next to a human doing the same sweep: roughly 0.8M tokens, around $15.
So how much would you trust AI on your code's quality, and is that enough? It clearly works here. But the only reason I trust the output is the structure behind it. The verify step is what took a confident guess and turned it into a bug I can reproduce on my own machine
Read the original on LinkedIn →
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Keep reading
I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.
I ran the same `ms` bug audit five times. The bug counts came back 7, 4, 5, 3, then 9. Nine distinct bugs surfaced across the runs, but only one showed up every single time. The other eight were a coin flip.
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.