I audited one repo's AI-generated CI in a single session and surfaced nine distinct bugs. The tests pass, the build compiles, the docs render. Every single bug lived in the metadata layer, including a tee pipe that hid six failing tests behind a green badge for weeks.
Setup
An audit of one repo's AI-generated GitHub Actions CI in a single session
Measured
9 distinct bugs, all in the metadata layer: tee masking 6 failing tests, 3 hallucinated SHA pins, colliding Pages deploys, and more
Verdict
VERDICT: FAILED
Nine LLM-generated bugs in one repo's CI. None were in the production code.
I audited a single repo's CI this week and surfaced nine distinct AI-generated bugs in one session. Tests pass. Build compiles. Docs render. Every bug lived in the metadata layer, not the work layer.
A test step in a workflow looked like this:
run: ctest --preset gcc 2>&1 | tee test-results.txt
Reasonable, right? Run the tests, save the output, done.
Except bash pipelines return the exit code of the last command, and tee always exits 0. So when ctest reported "6 tests failed," the pipeline still exited 0, the step reported success, and the CI badge stayed green.
Six failing tests had been silently masked on every push for weeks. The job summary said "Failed 6," visible to any human reading it. The exit code lied. The gate stayed green.
The fix is one line: set -o pipefail before the pipe.
The shape repeated every time.
Three hallucinated SHA pins that looked real but pointed at commits that did not exist in the upstream repo's history. The version comment was real. The hex string was invented.
Two workflows independently deploying to the same GitHub Pages target. Whichever ran last won. The other simply vanished.
The same action pinned to three different versions across six workflow files, because the LLM generated each file independently. Invisible to anyone reading any single file.
actions/upload-pages-artifactsilently strips dotfiles. MkDocs'.nojekyllgot dropped, so Pages Jekyll-processed the site and broke it.
Workflows using GHAS-gated features on a repo with no GHAS license. The LLM had no concept of licensing prerequisites.
My own audit script walked
.github/workflows/but missed.github/actions/. Even the verification had AI-shaped blind spots.
A summary step counted test failures by grepping for "Failed" across the full log. Three test names contained "Failed" as a substring, producing 6 phantom failures on every push despite every test passing. The authoritative ctest summary sat in the same file, ignored.
Every bug was in the metadata layer, not the work layer.
The C++ tests pass. CodeQL builds the database. MkDocs renders the docs. The work itself is correct. What was wrong was the layer that reports the work, gates it, references it, or coordinates it: exit codes, version pins, badges, summary scripts, deployment routing.
AI gets the happy-path semantics right and gets the metadata subtly wrong. The broken metadata then either hides real bugs or invents fake ones.
LLMs accelerate generation enormously. They do not accelerate the "is this actually doing what it claims" check by the same factor. That gap is where every bug in this list lived
Read the original on LinkedIn →
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Keep reading
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.
Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.
I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.