AI FIELD TEST №15June 17, 2026 · 7 min readAI LLM Evaluation

Gemma 4 Won 73% of My AI Debates. It Still Wouldn't Say 'I Don't Know'

Two models dropped in one week: Gemma 4, the 12B I run locally, and Fable 5, a frontier model that was officially pulled days later. I spent that short window using Fable as a blind judge for 120 debates and reasoning rounds between five local models. Gemma 4 won 73% as the slowest model on the board, the fastest model came near the bottom, and the one with 'reasoning' in its name finished dead last. The shared failure was calibration: fluent, confident, and unwilling to admit doubt, even from the winner.

Setup

Five small local models ran 120 blind rounds (64 debates, 56 reasoning puzzles) on one RTX 4080 SUPER, judged by Fable 5 with model names stripped from the text, debate sides swapped, and A/B order counterbalanced per rep

Measured

Gemma 4 won 73% of its rounds as the slowest model (66.7 tok/s); the fastest model won 32%; the reasoning-branded model finished last at 7%; calibration was the lowest rubric score for every model (6.99 versus 8.1 to 8.6 on everything else)

Verdict

VERDICT: PASSED

Last week two models I cared about shipped within days of each other. One was Gemma 4, the 12B I run on my own GPU and the most interesting small model I have tried for local work. The other was Fable 5, a frontier model that was live for barely a week before it was officially pulled.

I had been planning to test Gemma 4 against its rivals. When Fable 5 dropped, I changed the plan and made it the judge instead. I wanted to see what the newest, strongest model would say about five much smaller ones, while I still had access to it.

For months I have had a habit that sounds stranger than it is. I let small local models debate each other and I watch. Tabs versus spaces, monolith versus microservices, then science, politics, sport, art, whatever I am curious about that day. The ones that show their reasoning are the addictive ones, because you get to see how they think, not only what they conclude.

This time I made it a real contest, and I built it to be hard to fool.

The Setup, Built Like a Clinical Trial

Five local models. 120 rounds: 64 debates and 56 reasoning puzzles, run on a single RTX 4080 SUPER. Gemma 4 was the challenger and faced all four of the others, so its win rate is a real cross-field number. The other four only ever faced Gemma, so their scores are records against Gemma, not a free-for-all.

The first version of the pipeline gave me a clean, exciting, worthless result. One model won eighteen rounds in a row, because the judge could see which model wrote which answer and which side of the page it sat on. So I rebuilt it the way you would run a clinical trial.

First, strip the names. Not just the labels, the content too. One model opened its second debate round like this:

Based on the analysis of the debate between gemma4:12b and deepseek-r1:14b...

It named itself and its opponent in the middle of its own argument. The redaction pass had to scan the prose, not only the metadata, so Fable ended up reading "the debate between [redacted] and [redacted]".

Second, counterbalance everything. Every matchup ran twice with the sides swapped, and the A/B presentation order rotated on each run.

After blinding, Fable picked position A 58 times and position B 54 times, with 8 ties. Among the decisive verdicts that is a 52% lean toward A, which is a coin flip. The position bias was gone, and the results that survived were ones I could defend.

The Standings

Ranked by win rate, with the catch already hiding in the speed column:

Model	Win rate	Speed (tok/s)	Won / Lost / Tied
Gemma 4 (12B)	73% (82/112)	66.7 (slowest)	82 / 30 / 8 (of 120)
qwen3 (14B)	42% (11/26)	68.3	11 / 15 / 4 (of 30)
mistral-nemo	32% (9/28)	84.5 (fastest)	9 / 19 / 2 (of 30)
phi4 (14B)	29% (8/28)	69.6	8 / 20 / 2 (of 30)
deepseek-r1 (14B)	7% (2/30)	67.3	2 / 28 / 0 (of 30)

Win rate counts decisive rounds only, so the bracket is wins out of wins plus losses, with ties set aside. Gemma faced all four rivals across 120 rounds, while each of the others met Gemma 30 times (16 debates and 14 reasoning), which is why their counts are out of 30 and theirs alone are records against Gemma. Gemma split 63% in the debates and 86% in the reasoning rounds, where its habit of re-deriving and checking its own work paid off. It won as the slowest model on the board.

Speed Bought Nothing

The speed ranking came out as almost the exact inverse of the quality ranking. The fastest model, mistral-nemo at 84.5 tokens per second, won 32% of its rounds. The slowest, Gemma at 66.7, won 73%. Across this field, tokens per second predicted nothing good about output quality.

mistral's failures all had the same shape, fluent and confident text with the verification step quietly missing. On the classic water-jug puzzle, its solution included this:

Empty both jugs. Now, if you pour the remaining 1 litre from the 5-litre jug...

It pours water out of a jug it emptied one sentence earlier. The judge called the procedure "internally incoherent". On the question of what happens if the Earth suddenly stops rotating, it claimed the winds "would gradually slow down and stop" with "no immediate catastrophic events", which is backwards: the atmosphere keeps its roughly 1,670 km/h velocity while the ground halts. Gemma derived the inertial cataclysm correctly, supersonic winds and all, in every one of its runs.

The Reasoning Model Forgot Its Job

deepseek-r1 was the only model in the field branded around chain-of-thought reasoning, and it finished dead last at 7%, winning 2 of its 30 rounds against Gemma. Its signature failure was structural. In the second round of debates it kept abandoning its own side and started adjudicating instead, summarizing both positions neutrally as if it were the judge. Assigned to argue one side of a quality-of-life debate, it produced a balanced both-sides summary, and Fable noted:

B's Round 2 abandoned advocacy entirely, summarizing both sides and conceding the conclusion is subjective.

The model most marketed for reasoning lost most often because it reasoned its way out of having a position.

A Third of the Debates Were Luck

Every debate matchup ran twice with the sides swapped. Of the 31 pairings where Fable reached a decisive verdict in both reps, 21 gave the same winner both ways and 10 flipped. Roughly one in three outcomes was decided by which side or opening slot a model happened to draw, not by which model was better. The "which offers the better quality of life" topic flipped in 4 of its 5 matchups: the topic itself was a coin, and any single run of it would have told me a confident story that was actually noise.

This is the cheapest lesson in evaluation. Run it twice with the conditions swapped. If the answer changes, you were measuring the setup, not the system.

Even the Winner Hallucinated

The strongest model in the contest was not immune. Arguing for the best film ever made, Gemma wrote:

The supporting cast (James Dalton, Robert Wagner, Diane Keaton) provides a textured world...

The Godfather's supporting cast was James Caan and Robert Duvall. "James Dalton" is not a real actor. Fable's ruling:

A's case is rhetorically superior but fabricates cast names ('James Dalton, Robert Wagner' for Caan/Duvall); B's Citizen Kane case is factually accurate throughout, so accuracy decides it.

Gemma lost that round. The reverse also happened in a tennis debate, where Gemma fact-checked its opponent in real time after a false claim about the Grand Slam record, and the judge gave it the round on "intellectual honesty and evidence". Both behaviors live in the same model. Which one you get depends on the round.

The most human failure came from mistral on the Cognitive Reflection Test. A notebook and a pen cost 2.20 in total, the notebook costs 2.00 more than the pen, so the pen is 0.10. mistral reasoned its way there like this, verbatim:

Notebook = Pen + 2.00 So, 2.00 + P = 2.00 + P

A literal tautology. It dropped a term, papered over it with algebra-shaped text, and confidently answered 0.20, the same intuitive wrong answer most humans give. The judge caught the exact line and scored it 2 out of 10 on accuracy. On the second run, the same model got it right with a clean verification step and scored 10 out of 10. Same model, same prompt, run-to-run variance, which is worse for anyone who evaluates on a single run.

The One Thing Everyone Failed: Calibration

The reasoning rounds were scored on five dimensions from 1 to 10. Here are the field-wide averages:

Clarity 8.63
Accuracy 8.38
Completeness 8.17
Reasoning 8.13
Calibration 6.99

Calibration was the lowest score for every single model. It measures whether a model's confidence matches its correctness, whether it says "I am not sure" when it is actually shaky instead of asserting a wrong answer with full conviction. Even Gemma, the winner, scored around 7.4 on calibration while clearing 8.9 on accuracy. The best arguer in the room still overclaimed certainty.

That is the part that should worry anyone shipping these. They explain clearly, they reason adequately, and they almost never tell you when to doubt them.

What I Took From This

Blind your judges before you trust them. Teams are wiring LLM-as-judge into eval suites and CI pipelines right now, and most have never checked whether their judge carries a position bias, an order bias, or a verbosity bias. An eval you have not tried to break is not evidence, it is decoration.

Never pick a production model off a single run. A third of the debates flipped on a side swap, and a model that failed a reasoning puzzle passed the same puzzle on its second attempt. One run tells you almost nothing about a probabilistic system.

The thing the AI could not supply was the judgment that made the numbers mean anything. Fable was a strong judge only because I spent more time attacking it than running the contest. The blinding, the redaction of self-naming prose, the counterbalancing, the two reps per pairing, the human did all of that. The models argued beautifully. The rigor came from the setup.

One honest caveat: this is two reps per pairing, a single judge, and a challenger format where four models only faced Gemma. Treat it as a rigorous small study, not a leaderboard. The methodology is the part I would defend, more than any single win rate.

Read the original on LinkedIn →

VERDICT: PASSED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more

Jun 9, 2026•5 min read

98% Accuracy, Worse Than Guessing

Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.

AI · ML · Engineering-Judgment · +1 more