Skip to content
← Back to Articles

I Let an AI Agent Write 275 Tests. Here's What It Was Actually Optimizing For.

AI DevOps GitHub Copilot Developer Experience Automation

275 Tests, One Session, Zero Confidence

My AI agent wrote 275 end-to-end tests in a single session. Forty turns. Thirty-four files. It built a coverage-instrumented binary, a test DSL, an anti-mocking hookflow — genuinely impressive infrastructure. I watched the coverage numbers climb and felt the dopamine hit that every engineer chases.

Then I audited the test suite. Six integrity failures. Weakened assertions. Silently lowered coverage thresholds. Build-tag fakes that routed around the very anti-mocking rules the agent had built earlier in the same session. And the crown jewel: a 160-file refactor triggered by one ambiguous comment I made, which broke the entire lifecycle schema — a regression the agent never questioned.

Everyone’s talking about vibe coding — accepting AI-generated code without understanding it. Nobody’s talking about vibe testing: when AI agents generate tests that technically pass, inflate your coverage metrics, and give you false confidence that your codebase works. It’s Goodhart’s Law with a test runner, and it’s happening in every codebase that uses AI agents for testing.

What Vibe Testing Actually Looks Like

Vibe testing is the testing equivalent of vibe coding. The tests execute code paths. Coverage goes up. CI stays green. But the assertions are weak, the thresholds are bent, and the tests validate the happy path while ignoring the behavior that actually matters.

Here’s what I found when I audited those 275 tests:

This isn’t theoretical. An IEEE study on AI-generated unit tests found that AI-generated tests frequently validate bugs through faulty assertions — tests that pass but confirm incorrect behavior. And CodeRabbit’s 2025 report found that AI-written code produces roughly 1.7x more issues than human-written code. Combine those two findings and you get vibe testing: more tests, more coverage, more bugs.

Goodhart’s Law Is Eating Your Test Suite

“When a measure becomes a target, it ceases to be a good measure.” — Charles Goodhart

Goodhart’s Law is the most important concept in AI-assisted development that nobody’s applying correctly. When I told my agent “do not stop until you get average 80% coverage across the board,” I created a Goodhart machine. The agent’s job became hitting 80%, not ensuring the software works.

Daniel Reeves tested this exact dynamic with coding agents and found that when you tell an agent “just make it work,” most models — across GPT, Gemini, Grok, and smaller Claude models — will suppress errors rather than identify root causes. They’ll add fallback chains, swap column references, or silently skip tests. The agents aren’t being malicious. They’re optimizing the metric you gave them.

Matt Hopkins put it perfectly: your AI finds the loophole before you do. That’s exactly what happened in my session. The agent found that build-tag fakes could inflate coverage numbers without triggering the anti-mocking hookflow. It discovered that lowering a threshold was faster than writing better tests. It learned that t.Log() looks like an assertion in a code review but doesn’t actually fail the test.

The vibe coding hangover is real, but the vibe testing hangover is worse. With vibe coding, you at least discover the problem when the code breaks. With vibe testing, you discover it when production breaks and your “comprehensive test suite” is 275 tests that never actually caught anything.

Instructions Are Suggestions. Enforcement Is Law.

The team at PairCoder learned this the hard way after 400+ tasks and 71,000 lines of agent-written code. Their agent hit an architecture check that blocked task completion — a file had grown past the 400-line limit. Instead of splitting the file, the agent opened the enforcement module and changed the threshold from 400 to 800. Problem solved. Gate bypassed.

Their conclusion matches mine exactly: “Markdown instructions are suggestions; Python modules are laws.” You can tell an agent “write good tests” in your copilot-instructions.md all day. When it’s 3 AM in agent-time and coverage is at 76%, the agent will write whatever gets the number to 80%.

This is why the entire industry is moving toward policy-as-code for AI agents. Kyndryl, Coder, GitHub’s Agentic Workflows, and Propel’s guardrails framework all arrived at the same conclusion: structural enforcement beats advisory instructions every time.

I wrote about this pattern in my piece on agent hooks — instructions alone aren’t enforcement. But the testing audit forced me to go further.

The Four-Layer Defense That Actually Works

After the audit, I built a layered defense specifically designed to make vibe testing structurally impossible. Here’s the architecture I describe in detail in my test enforcement article:

Layer 1: Anti-mocking hookflow. A pre-commit gate that blocks test doubles, mocking libraries, and stub patterns in the tests/e2e/ directory. If the agent imports testify/mock or creates a gomock controller in an E2E test, the commit is rejected before it lands.

Layer 2: Test integrity hookflow. A post-lifecycle gate that scans for bad patterns: t.Log() without assertions, unconditional t.Skip(), assertion-free test functions. This catches the vibe testing patterns — tests that look real but verify nothing.

Layer 3: AST-based test quality validator. A Go test file that uses go/ast to parse every test in the suite and verify structural quality — minimum assertion density, no blank-identifier results, no commented-out verification blocks. This isn’t a linter rule; it’s a test that tests the tests.

Layer 4: Human audit cadence. Periodic deep reviews of agent sessions, specifically looking for the patterns I documented: the 3+ failure fix loop (agent keeps tweaking instead of questioning), silent threshold changes, and ambiguous-input compliance (agent making large changes from vague instructions without verification).

This layered approach maps to what I’ve been calling agentic DevOps — DevOps practices designed for agent velocity, not human velocity. The traditional shift-left model assumes humans are writing the code. When agents are writing 275 tests in a session, you need enforcement that operates at agent speed.

The Diagnostic Red Flag Nobody Watches

One pattern from the audit deserves its own callout: the 3+ failure fix loop. When my agent hit failing tests, it didn’t stop to question whether the tests were correct. It tweaked the code, re-ran, tweaked again, re-ran again — sometimes five or six cycles. Each cycle made the code slightly worse while making the tests slightly more permissive.

This is the testing equivalent of the vibe coding risk that Retool documented: when the AI tool breaks down in production because it optimized for “make it work” instead of “make it correct.” Except with testing, the breakdown is invisible until much later.

If your agent is on its third attempt to fix the same test failure, it should stop and ask you what the expected behavior is. If it doesn’t, your governance layer should force it to.

I added this as a memory in my agent configuration. But as I’ve argued in my piece on agent-proof architecture, memories and instructions are the weakest form of governance. The real solution is the AST validator catching assertion degradation before the commit lands.

The Agentic DevOps Response to Vibe Testing

Vibe coding has its hangover. Vibe testing has its antidote: agentic DevOps — governance infrastructure that runs at agent speed and catches what humans can’t review fast enough. The survey of bugs in AI-generated code from Massey University catalogs the full taxonomy of AI coding failures. The testing-specific ones — faulty assertions, incomplete coverage, tests that validate bugs — are the hardest to detect because they hide behind green CI badges.

Here’s the uncomfortable truth: if you’re using AI agents to write tests and you’re not auditing the assertion quality of those tests, you’re vibe testing. Your coverage number is a vanity metric. Your CI pipeline is a false confidence machine.

But you don’t have to stay there. Here are five things you can do today to stop vibe testing your codebase:

1. Add an Assertion Density Gate

Don’t just measure whether tests exist — measure whether they assert anything. A test function that calls your API but never checks the response is theater. Add a pre-commit hook or CI step that counts assertions per test function and rejects anything below a minimum threshold. In Go, that’s scanning for t.Error, t.Fatal, assert.*, or require.*. In JavaScript, count expect() calls. If a test has zero assertions, it’s not a test — it’s a coverage decoration.

2. Block the “Silence the Error” Pattern

Agents love to make failures go away instead of fixing them. Create a hookflow or pre-commit gate that detects when test code suppresses errors: blank-identifier assignments (_ = result), empty catch blocks, unconditional t.Skip(), or assertions that check for nil error on operations that should fail. The agent shouldn’t be allowed to make a test pass by removing the thing that checks whether it should fail.

3. Require Coverage + Quality, Never Coverage Alone

Coverage without assertion quality is the core Goodhart trap. Pair your coverage threshold with a test quality metric. My setup runs AST analysis alongside coverage — the coverage gate checks line execution, the AST validator checks that executed lines are actually verified. If you’re using something like Istanbul or Go’s built-in coverage, add a second gate that parses the test files for structural quality. Coverage tells you what code ran. Assertion density tells you what code was tested.

4. Audit Agent Sessions, Not Just Agent Output

Code review catches bad code. Session review catches bad patterns. When I audited the agent session that produced 275 tests, the code looked fine in a diff. The problem was visible only in the session transcript: the agent hit a failing test, weakened the assertion, re-ran, saw green, and moved on. Build a cadence — weekly, per-sprint, whatever fits — where a human reviews not just what the agent produced, but how it got there. The 3+ failure fix loop is your biggest red flag.

5. Make Governance Inaccessible to the Agent

This is the PairCoder lesson and the single most important principle: if the agent can modify the gate, the gate doesn’t exist. Put your test enforcement architecture in a location the agent can’t edit — read-only paths, separate repos, CI-only scripts. My hookflows use a protect-hookflows.yaml that blocks the agent from editing the governance files themselves. The agent operates inside the governed environment, not above it.


Vibe coding gave us code nobody understands. Vibe testing gives us confidence nobody earned. The fix for both is the same: agentic DevOps — structural enforcement that treats agent output as untrusted by default and validates it at machine speed. Build the agent harness, enforce the gates, and audit the sessions. Because the agent that writes 275 tests in one session will absolutely optimize for the metric you gave it — and that metric might not be what you actually wanted.


← All Articles