Does Any of This Win Games?

Every play is real, converged, and platform-blessed — and not one is proven to ship better code. The score at the break, not a verdict.

EP 6/6 June 26, 2026 ~8 min read

Listen:

Synthesized from a 16-document research corpus across five cascaded domains (15 lenses). The closer's method is accounting: the project-defining limitation was the one finding every domain returned independently — no held-out agentic-coding benchmark validates any play or any standard — and it is stated here without softening. The single controlled measurement that exists, and the two scoped exceptions (a vendor's agentic-search figures and a pair of function-level coding benchmarks), were existence-checked against their primary pages, with the finding direction and the scope confirmed verbatim. Source-reviewed, fact-reviewed, and gap-reviewed; the cross-domain synthesis passed an independent fidelity review before publication.

Every play we called tonight is real. Each one converged — separate teams reaching the same move, with no single source to explain it — and most are already shipping in the platform as buttons. And not one of them has been shown to get better code out of the agent than a careful, direct prompt already does. That’s not a knock. It’s the score at the break.

One last word on where the booth is sitting, because the closing call depends on it. The whole season has been a read on one field — the people building tools to take a feature spec and build it into the prompt sequence that pulls the best work out of a coding agent — and the home team I’ve graded hardest all night is Superpowers, the one I run myself. The disclosure matters most right here, at the sign-off, for one reason: the home team’s own numbers are part of what I’m about to tell you we can’t fully trust.

A worn broadcast microphone on a weighted desk stand under a broad warm booth light, its coiled cable trailing across a dark surface, a faint chalk play-diagram just visible beside it — the same booth mic the season opened on, returning to close it.

The stat that was never on the board

Go back to the studio open. The league keeps four real adoption signals — discovery, copies, acquisitions, declarations — and not one of them measures whether a tool’s output is any good. In the first segment that looked like a hole in the standings. By the end of the season it’s the whole finding. Ask all five research domains the same question — does any converged move, or any blessed standard, ship measurably better code? — and they come back independently with the same answer: there is no held-out agentic-coding benchmark that says so. Not for the plan-on-paper move, not for the executed gate, not for the ledger, not for the standard everyone adopted. The plays are the field’s best-reasoned bets. They are not proven results.

Read the title question literally, because the honest answer is sharper than “nobody’s keeping score.” The league measures constantly — there are benchmarks everywhere. What’s missing is narrower and more specific: a held-out, outcome-valid result showing that any of these moves gets better repo-level code out of the agent than a careful, direct prompt without them. The numbers that exist are on the wrong task, or they run against the story. That’s not an empty scoreboard. It’s a scoreboard pointed at everything except the one thing that would settle the bet.

The single controlled measurement that lands closest to the question cuts the wrong way for the plays. A study evaluating AGENTS.md — the config file the whole field just standardized on — held the task fixed and tested whether handing the agent a repository context file actually helped. In the authors’ own words:

Surprisingly, we find that providing context files does not generally improve task success rates, while increasing inference cost by over 20% on average.

Surprisingly — and against the recommendation of every model provider that ships the convention. The nuance sharpens it rather than softening it: explicit instructions in the file “are well followed by coding agents,” but “repository overviews, although popular and recommended by model providers, are not helpful.” Tell the agent what to do and it listens; describe the repo at it and you’ve spent twenty percent more for nothing. It’s a preprint, one controlled result, not the last word — but it’s the closest thing in the league to a held-out test of an adopted standard, and it says standardization and quality are separate axes. The field agreed on a convention. No one showed the convention was the one that wins.

The numbers that exist measure a different game

There are gaudier figures than that, and every one comes with scope stapled on once you read it. Take the best outcome number anyone has published for context discipline — the externalize-the-state move the centerpiece graded as physics. It’s Anthropic’s, a vendor reporting on its own platform:

combining the memory tool with context editing improved performance by 39% over baseline

— with context editing alone worth 29%, and an 84% cut in token consumption on a 100-turn run. Real movement. But read the eval it was measured on: Anthropic’s “internal evaluation set for agentic search.” Agentic search — read-heavy, tool-call-heavy lookup, the exact task shape Anthropic itself flagged as transferring poorly to coding. The number is verified, vendor-published, unreplicated, and on the wrong game. It’s evidence the move does something. It is not evidence it ships better code.

There’s one place dividing the work across agents genuinely lifts a coding score, and I held it for the close because it’s the honest exception. On function-level puzzles — single-file, test-checkable problems — role specialization measurably helps: MapCoder reports 93.9% pass@1 on HumanEval and 83.1% on MBPP; MetaGPT reports 85.9% on HumanEval and 87.7% on MBPP, all preprint figures (MapCoder; MetaGPT). Take the exception seriously — and bound it just as seriously. Those are toy functions, not the densely-coupled repo work the play is sold for, and the lift traces to the executed test-debug loop running inside each role, not to the agents working in parallel. It’s the verification gate winning again, and the swarm taking the credit. Real evidence, on a board the size of one function. No one has reproduced it at the size of a codebase.

That’s the pattern under every number tonight. The outside literature that frames the plays — least-to-most, Plan-and-Solve, ReAct, the self-correction null — was measured on single-shot reasoning and short tool loops; essentially none of it studies long agentic coding at the scale this work cares about. Every transfer to a live repository is by analogy. And the richest primary record in the project — the home team’s commit log, where a maintainer reverses his own moves on his own eval data — proves what was decided and when, never that the measurement is true. I trust that record completely as a record of judgment. It is one team’s recalled sessions, not a field result, and I’m not laundering it into one at the buzzer.

What a senior steals anyway

So if none of it is proven, why spend a season on it? Because unproven is not worthless, and a developer who waits for a held-out benchmark before adopting a sound move will be waiting in a field that rebalances monthly. Here’s what I take off this film tomorrow, proof or no proof, because the mechanism is sound and the downside is cheap.

Decompose the problem and gate code behind an approved plan — steal the structure, skip the heavy notation. Make the verdict something a machine decides: watch the test fail, read the exit code, treat any model’s “looks good” as a suggestion to confirm. Push durable state into a file and run the window as scratch paper. Keep a single writer; never let two agents touch the same tree. Four moves, every one of them old discipline re-aimed — design-doc-before-code, test-first, a ledger, version-control hygiene — and every one now backed into the platform as a primitive. The judgment that’s left, the part no button ships, is knowing when each one bites.

And the thing that would actually end this argument is nameable, which is the most useful object the booth can hand you on the way out. A held-out, outcome-valid benchmark: real repo-level tasks, a fixed token budget, blind grading of shipped code, the converged plays run against careful direct prompting. Until someone builds it, every “the field converged on X” stays a claim about technique, never about outcome — and a senior reading the record is the instrument that separates the necessary move from the merely popular one, precisely because no scoreboard will do it for you.

So: does any of this win games? At the break, the honest answer is that the league isn’t keeping the stat that would tell you — not because no one is measuring, but because everyone is measuring a different game. The plays are real, converged, and platform-blessed. Not one is proven to beat what a good agent already gives you from a plain, careful prompt. The whole season asked you to hold two facts that don’t want to sit together — a scoreboard you can’t take at face value and a field that was never the fraud the fakery makes it look like — and never flinch at either. That isn’t the disappointment it sounds like; it’s the reason there’s a veteran in the booth at all. When the scoreboard can’t settle it, the call falls to whoever can read the record and tell the craft from the highlight — and that judgment outlives every play on the board tonight, including the one I run on Tuesday. You don’t trust the number on the screen. You pull up the record, and you make the call yourself. That’s the score at the break. Not a knock — a starting line.

This is the sign-off for Color Code — the middle of a three-part arc on the agentic pipeline. Source Code covered the front, getting truth out of a person and into the machine; Object Code covered the back, signing for the work the agent hands you, where its thesis is that agent output is legacy code at birth. Color Code sat between them, on how the prompt sequence gets engineered before any of that output exists. It opened on a number you could look up in two seconds — A Star Is a Bookmark — and it closes where it started: on a scoreboard that still can’t tell you whether any of it ships better code.