AI FIELD TEST №06April 30, 2026 · 5 min readAI Engineering-Judgment Security

Nine AI-Generated Bugs in One Repo's CI. None in the Production Code

I audited one repo's AI-generated CI in a single session and surfaced nine distinct bugs. The tests pass, the build compiles, the docs render. Every single bug lived in the metadata layer, including a tee pipe that hid six failing tests behind a green badge for weeks.

Setup

An audit of one repo's AI-generated GitHub Actions CI in a single session

Measured

9 distinct bugs, all in the metadata layer: tee masking 6 failing tests, 3 hallucinated SHA pins, colliding Pages deploys, and more

Verdict

VERDICT: FAILED

Nine LLM-generated bugs in one repo's CI. None were in the production code.

I audited a single repo's CI this week and surfaced nine distinct AI-generated bugs in one session. Tests pass. Build compiles. Docs render. Every bug lived in the metadata layer, not the work layer.

The One That Set the Tone

A test step in a workflow looked like this:

run: ctest --preset gcc 2>&1 | tee test-results.txt

Reasonable, right? Run the tests, save the output, done.

Except bash pipelines return the exit code of the last command, and tee always exits 0. So when ctest reported "6 tests failed," the pipeline still exited 0, the step reported success, and the CI badge stayed green.

Six failing tests had been silently masked on every push for weeks. The job summary said "Failed 6," visible to any human reading it. The exit code lied. The gate stayed green.

The fix is one line: set -o pipefail before the pipe.

This Was One of Nine

The shape repeated every time.

Three hallucinated SHA pins that looked real but pointed at commits that did not exist in the upstream repo's history. The version comment was real. The hex string was invented.

Two workflows independently deploying to the same GitHub Pages target. Whichever ran last won. The other simply vanished.

The same action pinned to three different versions across six workflow files, because the LLM generated each file independently. Invisible to anyone reading any single file.

actions/upload-pages-artifact silently strips dotfiles. MkDocs' .nojekyll got dropped, so Pages Jekyll-processed the site and broke it.

Workflows using GHAS-gated features on a repo with no GHAS license. The LLM had no concept of licensing prerequisites.

My own audit script walked .github/workflows/ but missed .github/actions/. Even the verification had AI-shaped blind spots.

A summary step counted test failures by grepping for "Failed" across the full log. Three test names contained "Failed" as a substring, producing 6 phantom failures on every push despite every test passing. The authoritative ctest summary sat in the same file, ignored.

The Pattern Across All Nine

Every bug was in the metadata layer, not the work layer.

The C++ tests pass. CodeQL builds the database. MkDocs renders the docs. The work itself is correct. What was wrong was the layer that reports the work, gates it, references it, or coordinates it: exit codes, version pins, badges, summary scripts, deployment routing.

AI gets the happy-path semantics right and gets the metadata subtly wrong. The broken metadata then either hides real bugs or invents fake ones.

LLMs accelerate generation enormously. They do not accelerate the "is this actually doing what it claims" check by the same factor. That gap is where every bug in this list lived

Read the original on LinkedIn →

VERDICT: FAILED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more

Jun 17, 2026•7 min read

Gemma 4 Won 73% of My AI Debates. It Still Wouldn't Say 'I Don't Know'

Two models dropped in one week: Gemma 4, the 12B I run locally, and Fable 5, a frontier model that was officially pulled days later. I spent that short window using Fable as a blind judge for 120 debates and reasoning rounds between five local models. Gemma 4 won 73% as the slowest model on the board, the fastest model came near the bottom, and the one with 'reasoning' in its name finished dead last. The shared failure was calibration: fluent, confident, and unwilling to admit doubt, even from the winner.

AI · LLM · Evaluation · +1 more

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more