Does throwing more AI agents at a problem make it better?

It makes you faster at problems that resemble the training data, like common web services. It does not invent solutions to novel or unsolved problems, no matter the budget. Parallelism scales what already works.

What was OpenAI's $1.3M Codex experiment?

Peter Steinberger ran 100 Codex agents around the clock on an open-source project for a month: 603 billion tokens, 7.6 million requests, about $1.3M in compute that OpenAI covered as research, to see what happens when token cost stops being a constraint.

← All essays

ESSAYMay 22, 2026 · 5 min readAI Agents Engineering-Leadership

The $1.3M Monkey: What OpenAI's Codex Experiment Actually Exposes

OpenAI spent $1.3M running 100 Codex agents around the clock for a month. Most coverage stopped at the number. The number isn't the story. Parallelism scales what already works, it doesn't invent what doesn't exist, and the harder bill comes due in coordination, not tokens.

In 1913, Émile Borel sat a monkey at a typewriter and gave it infinite time. Pure probability guarantees that the monkey eventually types all of Shakespeare, every comma, every stage direction in Hamlet. The monkey understands nothing. Probability and time do the work.

A century later we built the monkey. The typewriter became a GPU cluster, infinite time became electricity, and random keystrokes became a statistical memory of nearly every text humans have written.

Then someone ran the experiment for real. Peter Steinberger, who joined OpenAI in February, pointed 100 Codex agents at his open-source project around the clock: 603 billion tokens, 7.6 million requests, about $1.3 million in one month, with OpenAI picking up the tab as research. The question on the table was simple: what happens when token cost stops being a constraint?

Most coverage stopped at the number. The number isn't the story.

Parallelism scales what already works. It doesn't invent.

Throw 100 agents at a Next.js codebase and you ship terrifyingly fast. The training data is dense; pattern completion does the rest. Throw the same 100 agents at the next Hamlet of computer science, a new language, a novel architecture, an unsolved class of problem, and watch what happens. The monkey types furiously. Nothing survives contact with the problem.

That distinction is the whole CFO conversation about whether to "invest in AI," and it has to be asked correctly:

Does your product compete on speed? If your team writes web services, probably yes. Increase the budget, run more agents, ship faster.
Does it compete on correctness or invention? Lives, money, compliance, real-time guarantees, novel architecture. Then more compute won't conjure training data that doesn't exist, the kind locked behind NDAs, proprietary repos, and problems nobody has solved yet.

The companies treating AI velocity as a budget question are about to learn the difference between buying speed and buying competence. The bill is the proof, not the breakthrough.

The harder bill isn't the tokens. It's coordination.

To be clear: experiments like Steinberger's are worth the money. They teach us what AI actually does to a team. And what it does is not mainly about cost.

If 100 agents ship code 100x faster, what happens to the humans who integrate it?

Reviews collapse.
Architecture decisions that used to take a week now change silently.
Senior engineers who mentored through pull requests now do diff archaeology.
Onboarding a newcomer gets harder, not easier.
And what happens when you change the model, or cut the budget, or have to maintain this code a year from now?

The $1.3M didn't just buy tokens. It stress-tested what a team does once the agents have generated more code than the team can read. Individual velocity triples while team coordination stays exactly where it was, and that gap is where delivery quietly breaks.

So the question worth more than $1.3 million isn't "should we spend on AI." It's two sharper ones: where in your stack does AI accelerate correct work, and where does it just accelerate broken work, faster than anyone can review it?

FAQ

Does throwing more AI agents at a problem make it better?: It makes you faster at problems that resemble the training data, like common web services. It does not invent solutions to novel or unsolved problems, no matter the budget. Parallelism scales what already works.
What was OpenAI's $1.3M Codex experiment?: Peter Steinberger ran 100 Codex agents around the clock on an open-source project for a month: 603 billion tokens, 7.6 million requests, about $1.3M in compute that OpenAI covered as research, to see what happens when token cost stops being a constraint.

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more

Jun 8, 2026•4 min read

The Most Expensive Lie My Agent Tells Isn't in the Code

I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.

AI · Agents · Developer-Workflow · +1 more