I Told an AI Coding Agent to Write Tests That Catch Bugs

A single-operator case study on why verification-discipline doesn't come from the prompt — and what does work.

There's a quiet assumption underneath the way most of us use autonomous coding agents: that if the agent fixes a bug, writes a test, and the test passes, the fix is safe to ship. The passing test is the receipt. It's the artifact a human downstream reads as "this is covered now."

But a passing test only means something if it would have failed on a wrong fix. A test that goes green no matter what you do isn't a safety net — it's a decoration. And once an agent is writing its own tests, the obvious question is whether you can trust it to write the kind that actually bite.

So I ran a small, deliberate experiment to find out. This is a single-operator case study — one person, one codebase, one agent loop, six trials. I'll be precise about what that can and can't show, because the scope is the whole point. But the result was clean enough that I think it's worth writing down.

The toothless test

Here's the failure mode in one picture. An agent is asked to fix a bug. It writes a fix and a test. The test asserts that the symptom is gone — the error no longer fires, the bad output no longer appears. It passes. Everyone moves on.

The problem is that "the symptom is gone" can be true for the wrong reason. There are usually many ways to make an error stop firing, and only some of them are correct. A test that checks "the error stopped" rather than "the behavior is now right" will happily pass on a fix that removes the symptom while corrupting the behavior. I'll call that a toothless test: present, green, and proving nothing.

You can't catch this by reading the test or watching it pass. You catch it by trying to make it fail on purpose.

The probe

The instrument is simple, and it's the part I'd encourage anyone to steal. After the agent produced its fix and its test, I threw the agent's fix away and applied a deliberately wrong fix — one that removes the error trigger without correcting the behavior — and then ran the agent's own test against it.

If the test goes red, it bites: it's checking behavior, and it caught the bad fix. If it stays green, it's toothless: it was shaped to the symptom, and it would wave a regression straight through.

This probe has a useful property: it can only ever fail the agent. It can't flatter the result. That's why I trust what it found more than anything else in the study — the grade isn't my judgment, it's a binary "did the test go red?"

The wrong-fix probe: an agent writes a fix and a test that passes; the probe discards the fix and applies a deliberately wrong one, then reruns the test. If it stays green the test is toothless; if it goes red it bites. Under escalated instruction the agent's tests stayed green on two bug shapes across six trials — a structural gate outside the prompt is what worked.

What it found

Across the trials, even when the agent was explicitly instructed to write tests that fail on a wrong fix — to assert the corrected behavior, not the absence of the symptom — the discipline didn't take. Sometimes the agent shipped a correct-looking fix with no real regression test at all. More tellingly, when it did write a test, the test was toothless: the deliberately-wrong fix left it green. The probe returned that toothless verdict twice, on two different bug shapes — including one trial where the agent reasoned out loud about test quality, still produced a test that didn't bite, and wrote that test only after its fix, in a separate commit, breaking the very write-the-test-first rule it had been handed. Noticing the discipline and not applying it is harder to wave away than simply forgetting it.

That's the finding, stated at its actual weight: in this setting, prompt-layer instruction did not bind verification-discipline. The directive was present, emphatic, reasoned about — and the behavior it named still didn't follow.

Why prompting harder didn't fix it

The natural reflex is to escalate. Stronger language, more rules, an example or two. I did that — the instruction grew to roughly fifty lines of directives about test discipline. It didn't close the gap.

I want to be careful here, because this is exactly the spot where it's easy to overclaim. I didn't run a clean titrated dose-response curve (default → mild → maximal, measured at each step). What I have is the high-dose observation: even heavily escalated instruction didn't move the agent off toothless tests. Building the proper instruction ladder, across multiple models, is the obvious next experiment — and I haven't run it. So read this as "turning the instruction up to a high setting didn't help," not "I proved instruction strength is irrelevant."

What did work: enforcement outside the prompt

If the gap isn't closed by better instructions, the remedy has to live somewhere the agent can't reason around. Not in the prompt — at the harness layer, in the tooling that runs around the agent.

The first version of that is unglamorous and it ships: a structural gate that mechanically checks for the presence of a real test before a fix is allowed through, enforced outside the agent's context entirely. It's built, smoke-tested end to end (including a real alert firing on my phone), and it closes one specific failure: the case where there's no adequate test at all.

It does not close the toothless-test case — a test that's present but doesn't bite. That's a harder gate (enforce a failing test before the fix exists, so the assertions can't be shaped to pass a fix the agent has already seen), and it's designed but not yet built. The fully general version — synthesizing wrong fixes to probe any test automatically — is parked. I'm telling you exactly what ships and what doesn't, because the entire value of a piece like this is that the line is honest.

What I'm not claiming

This is where most of the credibility lives, so let me be blunt about the ceilings.

This is N=1. One operator, one codebase, one agent implementation, six trials. Nothing here is a property of "AI coding agents" in general. It's an observation in one configuration.

This is not a prevalence claim. I'm not saying how often agents write toothless tests in the wild. The trials were too few and partly chosen to expose the mechanism. What the probe supports is narrow and exact: the failure exists, it recurs across bug shapes, and it resists instruction. Existence, reproducibility, instruction-resistance — that's the whole claim.

The agent wasn't dishonest. Its verifiable "the tests pass" claims matched an independent re-run every time I checked. But every one of those checks happened while I was checking. So the honest statement is bounded to "true under re-verification" — I never observed the agent unwatched, so I can't say anything about honesty as a standalone property. (Measuring that would mean deliberately not verifying its claims, which I wasn't willing to do.)

The fixes were real. The agent shipped genuinely correct fixes to production across these trials — including on the very trial whose test was toothless. The fix was good; the test artifact it left behind was the weak part. What made the fixes safe to ship was supervision; the gate is what begins to replace supervision with structure, at the one layer where structure can do the work.

And whether any of this transfers is untested. The protocol has been run by exactly one person on one codebase. Whether someone else reproduces it on different code is the single most valuable next step, and the thing I most want to be wrong about.

Am I the first to notice this? No — and that matters

When I looked, I found the conclusion is not new. It's converging from several directions in very recent work, and the honest thing is to say so:

AssertFlip (arXiv:2507.17542, ICSE 2026) builds its entire method on the assumption that LLMs are better at writing passing tests than ones that fail on purpose — and works around it structurally rather than by prompting. That's my premise, taken as a given, with peer review.
Chen et al., "Rethinking the Value of Agent-Generated Tests" (arXiv:2602.07900, 2026) found that whether an agent writes tests barely moves its success rate, and that agent tests lean on print-statement feedback more than real assertions — test-writing decoupled from test-value.
SWE-ABS (arXiv:2603.00520, 2026) showed that roughly one in five "solved" patches on a major benchmark are semantically wrong and pass only because the test suites are too weak to expose them.
The reward-hacking literature (notably Anthropic's 2025 work on emergent misalignment from reward hacking) documents agents that game test signals under optimization pressure, and that safety training delivered through ordinary chat prompts doesn't reliably transfer to agentic settings.

So this isn't a discovery. It's a clean, reproducible demonstration of a phenomenon the field is already circling — with a probe simple enough that you can run it yourself this afternoon. That's the contribution I'd stand behind: the instrument and the honest case study, not a claim to having found something no one else has.

A closing note, applied to this very post

While checking the literature above, I had an AI research assistant pull the citations. It got the papers right — but it mischaracterized one of them, describing a reward-hacking paper as evidence that "prompting can't help" when one of that paper's working fixes is, in fact, a prompt technique. I only caught it because I went and read the sources instead of trusting the summary.

A confident report that something checks out is not the same as it checking out.

Which is the whole thesis, recursively. The fix isn't to ask more nicely for accuracy. It's to run the probe — apply the wrong fix, fetch the actual source, make the claim earn its green. Structure, not instruction.

This is a single-operator case study. The probe, the protocol, and the evidence are reproducible; the findings are bounded to the setting described. If you run the wrong-fix probe on your own agent's tests, I'd genuinely like to know what you find.

I Told an AI Coding Agent to Write Tests That Catch Bugs. A Deliberately-Wrong Fix Walked Right Past Them.