OpenAI spent $1.3M running 100 Codex agents around the clock for a month. Most coverage stopped at the number. The number isn't the story. Parallelism scales what already works, it doesn't invent what doesn't exist, and the harder bill comes due in coordination, not tokens.
In 1913, Émile Borel sat a monkey at a typewriter and gave it infinite time. Pure probability guarantees that the monkey eventually types all of Shakespeare, every comma, every stage direction in Hamlet. The monkey understands nothing. Probability and time do the work.
A century later we built the monkey. The typewriter became a GPU cluster, infinite time became electricity, and random keystrokes became a statistical memory of nearly every text humans have written.
Then someone ran the experiment for real. Peter Steinberger, who joined OpenAI in February, pointed 100 Codex agents at his open-source project around the clock: 603 billion tokens, 7.6 million requests, about $1.3 million in one month, with OpenAI picking up the tab as research. The question on the table was simple: what happens when token cost stops being a constraint?
Most coverage stopped at the number. The number isn't the story.
Throw 100 agents at a Next.js codebase and you ship terrifyingly fast. The training data is dense; pattern completion does the rest. Throw the same 100 agents at the next Hamlet of computer science, a new language, a novel architecture, an unsolved class of problem, and watch what happens. The monkey types furiously. Nothing survives contact with the problem.
That distinction is the whole CFO conversation about whether to "invest in AI," and it has to be asked correctly:
The companies treating AI velocity as a budget question are about to learn the difference between buying speed and buying competence. The bill is the proof, not the breakthrough.
To be clear: experiments like Steinberger's are worth the money. They teach us what AI actually does to a team. And what it does is not mainly about cost.
If 100 agents ship code 100x faster, what happens to the humans who integrate it?
The $1.3M didn't just buy tokens. It stress-tested what a team does once the agents have generated more code than the team can read. Individual velocity triples while team coordination stays exactly where it was, and that gap is where delivery quietly breaks.
So the question worth more than $1.3 million isn't "should we spend on AI." It's two sharper ones: where in your stack does AI accelerate correct work, and where does it just accelerate broken work, faster than anyone can review it?
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.
I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.
The Opus 4.8 feature worth caring about isn't a smarter model. It's that the plan moves out of the AI's working memory and into a script it runs on its own. I gave it a real scraping job. It scaled to a self-organizing agent team in ~90 minutes, and the price was control and tokens.