AI FIELD TEST №09June 3, 2026 · 5 min readAI Agents Developer-Workflow

I Pointed an AI Swarm at an npm Package Used by Millions

I gave Claude Code one goal: audit the `ms` duration parser for bugs. It orchestrated about 34 hunt, verify, and report agents that took 22 candidates down to 11 verified and 8 confirmed real bugs. Twelve minutes, around 0.8M tokens, roughly $15. The failing inputs reproduce locally.

Setup

Claude Code, one prompt: "audit this library for bugs," pointed at the `ms` npm package

Measured

22 candidates to 11 verified to 8 confirmed bugs, in ~12 min for ~0.8M tokens (~$15)

Verdict

VERDICT: PASSED

The target was ms, the tiny duration parser behind calls like ms('2 days'), buried in the dependency tree of countless Node projects. The kind of code you would assume is bug-free.

I gave Claude Code one goal:

"audit this library for bugs."

Then I watched it orchestrate a multi-agent workflow.

Hunt, Verify, Report

Hunt. Twelve specialist agents launched in parallel, each assigned exactly one class of bug: off-by-one, rounding, overflow, regex traps, parsing edge cases. Not one generalist skimming everything. Twelve narrow lenses.

Verify. Every candidate went to a fresh skeptic agent, told to assume the bug was bogus until it could produce a real, reproducible failure. This is the part that matters. It throws out the false findings a single confident AI would happily report.

Report. Only the survivors, each with an exact repro and file:line.

The Result

22 candidates -> 11 verified -> 8 distinct confirmed bugs

A few of them:

Negative durations that round the wrong way
Plural labels that contradict their own number, like -1 days
Unit boundaries that print 12mo instead of 1y

I did not trust the agents blindly. I ran the failing inputs locally, and they reproduce.

Then I had it cross-check against GitHub issues and PRs: one bug already has open fix PRs upstream, and two I could not find reported anywhere. Filing those is the next step.

The Economics

The whole sweep was:

34 agents -> ~12 minutes -> one prompt

The cost of "AI finds bugs" is remarkably cheap next to a human doing the same sweep: roughly 0.8M tokens, around $15.

So how much would you trust AI on your code's quality, and is that enough? It clearly works here. But the only reason I trust the output is the structure behind it. The verify step is what took a confident guess and turned it into a bug I can reproduce on my own machine

Read the original on LinkedIn →

VERDICT: PASSED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 8, 2026•4 min read

The Most Expensive Lie My Agent Tells Isn't in the Code

I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.

AI · Agents · Developer-Workflow · +1 more

Jun 4, 2026•4 min read

Same Prompt, Five Runs: 7, 4, 5, 3, 9

I ran the same `ms` bug audit five times. The bug counts came back 7, 4, 5, 3, then 9. Nine distinct bugs surfaced across the runs, but only one showed up every single time. The other eight were a coin flip.

AI · Agents · Developer-Workflow

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more