AI FIELD TEST №16May 30, 2026 · 5 min readAI Agents Workflows

I Let Opus 4.8 Hold Its Own Plan. It Scaled, and I Lost the Wheel

Name: I Let Opus 4.8 Hold Its Own Plan. It Scaled, and I Lost the Wheel
Item: Claude Code (Opus 4.8 dynamic workflows)
Rating: 3
Author: Fadi Labib

The Opus 4.8 feature worth caring about isn't a smarter model. It's that the plan moves out of the AI's working memory and into a script it runs on its own. I gave it a real scraping job. It scaled to a self-organizing agent team in ~90 minutes, and the price was control and tokens.

Setup

Opus 4.8 'dynamic workflows' (preview) pointed at one real job: find every operating business in a city and verify each one is real

Measured

Run time, where the plan lives, result confidence on independent review, and what control you trade away

Verdict

VERDICT: MIXED

The marketing for Opus 4.8 is about a smarter, more honest model. The feature I actually care about is quieter, and it isn't even really new: dynamic workflows, where the plan moves out of the AI's working memory and into a script the AI writes and runs on its own. It took me a full night to understand what it changes.

The test

The cleanest way to show it is the job I used it for: pulling real data off the web. Specifically, find every operating business in a city and verify each one is real.

The old script way: grab fixed pieces of a page; collapse the moment the structure changes.
One agent: reads the page the way a person would, doesn't care how it's built, can judge what's on it. But it does one thing at a time.
The question that actually matters at city scale: who holds the plan?

Until now the answer was Claude itself: it ran the show, holding the plan and every agent's partial result in its own working memory. Fine for a handful of agents. On a big job the memory fills and it loses the thread.

Dynamic workflows move the plan out. You describe the goal; Claude writes the plan as an actual script; a separate engine runs that script in the background. The half-finished results live in the script, not in the model's memory. Only the final answer comes back.

What it earned

For verifying thousands of businesses: worth it. ~90 minutes, a small army of agents running at once, my session free the whole time, and a high-enough confidence level (~70% on my own independent review under strict rules). Three things genuinely landed:

Scale — it runs many agents on one job instead of a handful.
No drift — the plan lives in code, so it doesn't get lost the way a long conversation does. A resumable run saved the job more than once when the work got disrupted.
Built-in fact-checking — independent agents review each other's work. Because they run separately, their mistakes don't line up, so disagreement is how you catch errors. (I've argued for adversarial review of LLM output before; this bakes it in.)

Why the verdict is MIXED, not PASS

The trade is real, and it's the whole point:

Tokens — a single run burns far more than before. That's the AI vendors' business model, not an accident.
Control — once it starts, you can't step in or intervene.
Maturity — still a preview; resumable only inside the same session.

So for verifying thousands of businesses, I'd hand over the wheel and pay the bill. For a five-minute task it's overkill, and the tool will happily push it on you anyway.

The shift worth watching isn't that the model got smarter. Models are already capable at plenty of tasks. It's that the orchestration moved into code the AI runs itself, which is where the scale and the savings come from, and where you give up the ability to babysit. The judgment call is no longer "is the output right?" It's "how big does a job have to be before handing over the wheel is worth what it costs?"

VERDICT: MIXED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more

Jun 8, 2026•4 min read

The Most Expensive Lie My Agent Tells Isn't in the Code

I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.

AI · Agents · Developer-Workflow · +1 more