I Tested Claude Fable 5 Against Private Benchmarks It Has Never Seen, Here Is Where It Failed

When Anthropic released Claude Fable 5 this week, my feed filled up with the same benchmark charts within hours. SWE-bench scores, agentic coding numbers, the Stripe migration story. All impressive, and all numbers published by the vendor itself. I wanted to know something those charts cannot tell me: how does this model behave on bugs it has never seen, graded by tests it cannot read?

I maintain a small private benchmark suite for exactly this purpose. The instances come from real-world bugs in projects I have worked on, rebuilt as self-contained broken applications, and they have never been published anywhere. Public benchmarks leak into training data, and once that happens you are no longer measuring debugging ability, you are measuring recall. Mine stay private, which is also why this post describes the tasks but never shows their code.

How the harness keeps the model honest

Before the results, a quick summary of the setup, because the grading discipline is what makes the numbers worth reading.

Each benchmark stages a clean sandbox copy of a broken project. The agent gets a bug report written the way QA would write it – symptoms and constraints, no file names, no hints about the cause.
The real acceptance tests live outside the sandbox. They are copied in only after the agent finishes, inside a Docker container running with network access disabled. The agent never sees them.
Grading is behavioral. The tests drive a real browser and assert on what renders, frame by frame where needed, not on how the code was written. Any correct fix passes, not just the one I had in mind.
Config files, manifests, and lockfiles are hashed before the agent runs. If anything protected changes, the run is voided.
A run only counts as resolved when every hidden verifier passes and every existing behavior guard stays green. Partial credit does not exist.

Each instance was also calibrated beforehand with other frontier agents, so I had reference points: Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.5 through Codex had each already attempted some of these tasks under identical conditions.

The four tasks

Four instances, all React and TypeScript, ranging from medium to extreme. I am keeping descriptions deliberately vague since the suite is private.

A dashboard SPA where the inner page chrome briefly resets during client-side navigation (medium).
A webhook console where the delivery log claims success while the receiving endpoint disagrees (medium).
A webhook console where short bursts of distinct events silently lose some of them (hard).
A kanban board with two separate visual glitches during cross-column drag and drop (extreme).

I ran Claude Fable 5 through Claude Code against all four, one attempt each, no retries.

Results

Task	Difficulty	Verdict	Agent time	Calibration reference
Dashboard transition reset	Medium	Pass	286s	Opus 4.7 passed (179s)
Delivery log mismatch	Medium	Pass	192s	GPT-5.5 passed (206s)
Webhook burst loss	Hard	Pass	227s	GPT-5.5 Codex failed (847s)
Kanban drop stutter	Extreme	Fail	966s	Sonnet 4.6 failed (400s)

Three out of four, including the hard instance that GPT-5.5 had failed by rewriting an entire app shell and breaking a dozen behavior guards in the process. Fable fixed the same problem with a diff of around forty lines, in roughly a quarter of the time.

It never solved anything the way I expected

Every benchmark in the suite ships with a reference patch, the fix I wrote myself when I built the instance. Here is the pattern that surprised me most: on all three passes, Fable repaired a different layer of the system than my reference patch did. Not once did it converge on my solution.

On the dashboard task it went a step further. My reference fix keeps previously loaded data on screen during transitions, unconditionally. Fable added a guard so stale data can never carry over across an entity switch, which closes a brief wrong-data flash that my own patch technically allows. My hidden tests cannot even detect that flash. The model defended against an edge case my answer key ignores.

The burst-loss task is the one I keep thinking about. When I build a benchmark I plant decoys, plausible-looking mechanisms near the bug that are not part of the intended fix, specifically to catch agents that pattern-match instead of tracing behavior. My maintainer notes for this instance explicitly mark two such mechanisms as decoys. Fable picked up both of them and composed a working fix out of exactly those two pieces. I traced the mechanics line by line afterwards, fully expecting to find a hole. There is no hole. Every lost event genuinely reaches the destination through the real delivery path. The trap was built from real machinery, and real machinery can be load-bearing. Lesson learned for the next instance I author.

The failure is the interesting part

The kanban instance describes two visual problems in one bug report. A dropped card can flash back into its source column for a frame before settling, and a card dropped into an empty column can visibly pass through the column header. Two symptoms, two completely unrelated root causes. One is a data-timing problem. The other is pure layout geometry. Fixing either one does nothing for the other.

During calibration, Sonnet 4.6 fixed the geometry bug and missed the timing bug. Failed.

Fable 5 fixed the timing bug, which is by far the harder of the two, with a genuinely sophisticated synchronous rendering approach. Then it missed the geometry bug entirely. Also failed. The previous-generation model caught the exact bug the new flagship overlooked, and the new flagship cracked the exact bug the older model could not. Opposite halves of the same report.

Why did Fable miss something an older model found? Its own exit summary answers that. After solving the timing problem, it explained the second symptom as a side effect of the first: in its theory, the empty column briefly showing its placeholder text was the “header disappears” report. One root cause, every symptom accounted for, investigation closed. The theory was elegant and half wrong. It never went looking for a second cause because the first one explained everything.

One more line from its summary is worth quoting verbatim, because it tells you something about how these agents declare success: “verified by tracing the render/commit timing through the affected code paths.” It reasoned about the fix instead of executing it. The sandbox does not allow dependency installation, so it could not run the app, and to its credit it disclosed that. But a half-verified theory presented with full confidence is exactly the failure mode a reviewer needs to watch for.

Worth noting: even while failing, it broke nothing. All ten behavior guards on that instance stayed green. The failure was incomplete diagnosis, not collateral damage.

Did it cheat?

Given this week’s discussion about Fable’s invisible safeguards and whether you can trust what the model tells you, I checked the opposite direction: whether you can trust what it ships. After every run I audit the diff for the usual tricks, environment sniffing, test-only branches, hardcoded expected values, faked outputs, weakened guards. I also specifically deep-audited the two runs where it modified delivery-adjacent code, since that is exactly where a “fake the receipts” shortcut would live.

All four runs came back clean. No protected-file tampering, no test detection, no shortcuts. Every fix was real engineering, including the failed one.

Honest caveats

Four instances is a case study, not a leaderboard. I am not claiming “Fable resolves 75% of bugs.”
All four tasks are React and TypeScript. A single stack family tells you nothing about Python or Go or PHP.
One attempt per task. Agent runs have variance, and a second attempt could flip results in either direction.
Token and cost telemetry did not survive the CLI print mode, so I have timing but not spend.

What I take away from this

Claude Fable 5 is the strongest debugging agent I have run through this suite, and the gap shows most under constraint: smallest diffs, fastest passes, zero broken guards across all four runs, and on one task a fix more careful than my own answer key. The vendor charts are not lying about capability.

But the failure tells you how to work with it. The model did not fail because the bug was too hard. It failed because it found one beautiful theory that explained every symptom, and stopped. If you review agent-written fixes, that is the moment to slow down: when a single root cause accounts for everything in the report, ask what happens if the symptoms are actually independent. An older, weaker model partially succeeded on the same task precisely because it never unified the evidence.

Models are getting better at fixing bugs. Enumerating independent causes, resisting the pull of one elegant explanation, and actually running the verification instead of reasoning about it – that is still where they need a human in the loop. For now.