The Opus 4.8 feature worth caring about isn't a smarter model. It's that the plan moves out of the AI's working memory and into a script it runs on its own. I gave it a real scraping job. It scaled to a self-organizing agent team in ~90 minutes, and the price was control and tokens.
Setup
Opus 4.8 'dynamic workflows' (preview) pointed at one real job: find every operating business in a city and verify each one is real
Measured
Run time, where the plan lives, result confidence on independent review, and what control you trade away
Verdict
VERDICT: MIXED
The marketing for Opus 4.8 is about a smarter, more honest model. The feature I actually care about is quieter, and it isn't even really new: dynamic workflows, where the plan moves out of the AI's working memory and into a script the AI writes and runs on its own. It took me a full night to understand what it changes.
The cleanest way to show it is the job I used it for: pulling real data off the web. Specifically, find every operating business in a city and verify each one is real.
Until now the answer was Claude itself: it ran the show, holding the plan and every agent's partial result in its own working memory. Fine for a handful of agents. On a big job the memory fills and it loses the thread.
Dynamic workflows move the plan out. You describe the goal; Claude writes the plan as an actual script; a separate engine runs that script in the background. The half-finished results live in the script, not in the model's memory. Only the final answer comes back.
For verifying thousands of businesses: worth it. ~90 minutes, a small army of agents running at once, my session free the whole time, and a high-enough confidence level (~70% on my own independent review under strict rules). Three things genuinely landed:
The trade is real, and it's the whole point:
So for verifying thousands of businesses, I'd hand over the wheel and pay the bill. For a five-minute task it's overkill, and the tool will happily push it on you anyway.
The shift worth watching isn't that the model got smarter. Models are already capable at plenty of tasks. It's that the orchestration moved into code the AI runs itself, which is where the scale and the savings come from, and where you give up the ability to babysit. The judgment call is no longer "is the output right?" It's "how big does a job have to be before handing over the wheel is worth what it costs?"
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.
I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.
OpenAI spent $1.3M running 100 Codex agents around the clock for a month. Most coverage stopped at the number. The number isn't the story. Parallelism scales what already works, it doesn't invent what doesn't exist, and the harder bill comes due in coordination, not tokens.