AI FIELD TEST №17June 21, 2026 · 6 min readAI Agents Evaluation

A Ruleset Promised Up to 95% Fewer Tokens. The Savings Were Noise

A 'write minimal code' ruleset claimed it cut token cost by 60-95%. I built a fair, isolated harness to confirm it on 10 real coding tasks. The headline savings turned out to be a broken baseline and an agent that sometimes booted Docker. Once I controlled the environment, the effect vanished into noise.

Setup

The same 10 coding tasks run through two arms, a plain agent (baseline) and the agent plus Ponytail (a 'write minimal code' ruleset), in fresh git worktrees at a pinned SHA on a pinned Haiku 4.5 snapshot, scoring tokens AND correctness

Measured

Apparent per-task savings ran from 91% fewer tokens to 14x MORE. Once the baseline was environment-controlled, the same arm on the same task swung 1.6-1.8x run to run, noise as large as the treatment itself

Verdict

VERDICT: FAILED

You have seen the pattern. An influencer screenshots a GitHub utility, lifts the headline claim straight from its README, and posts it. It goes viral. People save it out of FOMO. The repo climbs from a few hundred stars to thousands in a week. The screenshot gets the reward. The test never happens.

So I run my own tests, and I write down what actually happens. This one is about the tools that promise to cut your token bill. On paper the premise is plausible. But if you understand how these models work, it is hard to believe you cut tokens that hard while keeping the same output quality. The most recent one I put through the harness is Ponytail.

Ponytail makes a clean promise: tell the agent to write minimal code, and you spend 60-95% fewer tokens. Its own honest revision later put the figure at 54%. Either way, that is a big number, and big numbers are exactly the ones worth checking.

My first uncontrolled pass agreed: Ponytail used as little as 9% of baseline tokens. Then I controlled the experiment, and the saving disappeared.

So I built a harness to settle it properly.

The Setup

Same 10 coding tasks, from a one-line string fix up to a multi-file audit, against a real repo (fastapi/full-stack-fastapi-template) pinned to one commit. Each run got a fresh git worktree at that exact SHA, so every arm started from byte-identical bytes. One pinned model snapshot (Haiku 4.5). Every instruction side-channel stripped before each run, so the baseline could not be secretly contaminated, then Ponytail re-added only its own ruleset for its arm.

Two arms: the plain agent, and the agent plus Ponytail. Crucially, I scored two things, not one. Tokens, and correctness, graded by a fixed test oracle through a real Postgres. Because tokens alone are a trap. An agent that quietly does less, or gives up, also emits fewer tokens. That looks like a saving. It is not one.

I Validated the Judge First

Before the oracle graded a single agent's run, I tested the oracle itself. I pointed it at the clean code (it must pass) and at the deliberately bugged code (it must fail), and confirmed it returned exactly those verdicts. Green on correct, red on broken.

This sounds pedantic. It is the one step that separates a measurement from a vibe. A grader that has never been shown to fail on bad input is not measuring correctness, it is measuring its own assumptions. Validate the judge before you trust its verdicts.

The Numbers That Did Not Survive

The first pass looked spectacular for Ponytail. On the ungraded tasks it ran at 9% to 48% of baseline tokens. A 91% reduction is right in the promised range.

Then I looked at why.

The baseline agent had no working database, so it kept trying to run a test suite that could not run, and burned millions of tokens thrashing against it. Ponytail's "be minimal" rule made it give up sooner. That is not efficiency. That is one agent quitting earlier than the other.

A spectacular efficiency gap can be manufactured two ways. You can make the treatment brilliant, or you can make the baseline suffer. At that point I had no proof which one I was looking at.

Worse, some runs spun up the template's entire Docker stack (Postgres, backend, frontend, Playwright, the lot) and built images. One Ponytail run hit 4.18 million tokens, 14x the baseline for the same task. That has nothing to do with writing minimal code. It is a coin flip about whether the agent decides to boot Docker.

So I removed both confounds. I gave both arms a working database so the baseline could not thrash, and I shimmed out Docker so neither could boot the stack. A fair fight at last.

The fair fight had no winner. Run the same arm on the same task twice and the token count swung 1.6 to 1.8x on its own. In the controlled run, Ponytail actually used more tokens than baseline on one task. The run-to-run noise was as large as the entire effect I was trying to measure.

What Actually Failed

The verdict is FAILED, and it is aimed precisely.

It is not aimed at the tool's output. Every single time I could grade correctness, both arms were correct. Ponytail never dropped a validation, never broke the fix, never failed a task. As a "write less code" ruleset it is free and harmless, and on two graded tasks it was genuinely a bit cheaper.

What failed is the headline number. "Up to 95% fewer tokens" did not survive contact with a controlled baseline. It shrank to "indistinguishable from noise" the moment the workspace was fair, exactly as Ponytail's own public claim had already shrunk from 94% to 54% when its baseline was fixed.

This is the same trap, twice. A token delta with no correctness signal, measured against a broken baseline, will always flatter whichever tool quits first.

What To Do Instead

A "saves X% tokens" claim is meaningless without two things: a working, environment-controlled baseline, and a correctness check held fixed. Demand both before you believe the number
The biggest token swings I saw came from a broken test environment and an agent randomly booting Docker, not from any prompt. Fix the agent's workspace before you reach for a ruleset. It saves more
If you want a real answer, you need many repetitions to average the noise down, which means a metered API account and small bounded tasks, not a rate-limited subscription running 1.5-million-token sessions. I tried twenty graded runs in one batch to get error bars. Eighteen came back instantly throttled and empty. The thing that makes these agents expensive to run is the thing that makes them expensive to measure

The Honest Sentence

Eight dollars of quota and a great deal of stubbornness bought one honest sentence: no correctness cost, and a token saving that is unproven and mostly an artifact of how the question was asked. That is thinner than "94% fewer tokens", and it is the part that is true. None of the inflated numbers need bad faith to appear. You compare against a baseline that happens to be struggling, you report the good run, you never check whether the work was correct. The path of least resistance points straight at a number too good to be true.

The tool did no harm. The claim did not hold. Those are two different sentences, and almost every token-savings headline you have read collapses them into one.

VERDICT: FAILED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 17, 2026•7 min read

Gemma 4 Won 73% of My AI Debates. It Still Wouldn't Say 'I Don't Know'

Two models dropped in one week: Gemma 4, the 12B I run locally, and Fable 5, a frontier model that was officially pulled days later. I spent that short window using Fable as a blind judge for 120 debates and reasoning rounds between five local models. Gemma 4 won 73% as the slowest model on the board, the fastest model came near the bottom, and the one with 'reasoning' in its name finished dead last. The shared failure was calibration: fluent, confident, and unwilling to admit doubt, even from the winner.

AI · LLM · Evaluation · +1 more

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more

Jun 8, 2026•4 min read

The Most Expensive Lie My Agent Tells Isn't in the Code

I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.

AI · Agents · Developer-Workflow · +1 more