The Plays the Game Wrote

The moves the work itself forced — getting what the model can't be trusted to hold out of its head and onto durable ground.

EP 2/6 June 26, 2026 ~11 min read

Listen:

Synthesized from a 16-document research corpus across five cascaded domains (15 lenses). The centerpiece method is archival: every play below was checked against the team's own record — git objects with dates and SHAs, verbatim reads of the installed skill files and the Rust verifier source, and the peer-reviewed papers behind each claim. The convergence-isn't-copying argument was re-derived against the GitHub creation-date backbone. Source-reviewed, fact-reviewed, and gap-reviewed; the cross-domain synthesis passed an independent fidelity review before publication.

When four teams reach for the same move and the dated record rules out a single source they all copied, you stop calling it style and start asking what forced their hand.

That’s the centerpiece of the whole broadcast, so let me set the vantage in a line. The field I’m covering builds tools for a single hard job: take a feature spec and engineer it into the prompt sequence that gets the best work out of a coding agent. The home team in that field is a methodology framework called Superpowers — my daily driver, which is exactly why it gets the hardest calls in this booth and never the friendly ones. Last segment was about the standings and why you don’t trust them. This one is about the one pattern under the noise that earned its place the hard way.

Here it is, stated once and then I’ll vary it so it doesn’t go numb on you: the plays that survive scrutiny all do a single thing — they get what the model can’t be trusted to keep in its head out of its head and onto something durable. The plan onto paper. The verdict onto an executed test. The state into a ledger file. The boundary into a worktree. The plays that disappoint do the opposite: they ask the model, or a swarm of models, to just hold those things. That’s the call. Four camera angles on it, same fundamentals each time.

A worn chalkboard play diagram of looping arrows and abstract marks, lit by a warm overhead glow against a deep navy night.

Angle one: the plan goes on paper

Decompose first, write it down, hand it off. The plan goes on paper.

The first thing the strong teams refuse to do is let the model improvise the architecture in its own head while it types. They make it decompose the problem first, write the decomposition down, and hand that artifact to whatever does the building.

The research desk had this one before the field did. The load-bearing prompt-engineering result for decomposition is least-to-most — break a hard problem into ordered subproblems where each is “facilitated by the answers to previously solved subproblems,” and the gain on the SCAN compositional benchmark is not subtle: least-to-most hit, in the paper’s words,

at least 99% … compared to only 16% accuracy with chain-of-thought prompting.

Read that the right way. The 99-versus-16 is a 2022-model benchmark on a synthetic task — the number is an artifact of its generation and won’t survive to your repo. What survives is the structure: ordered subproblems, each leaning on the last. The marquee openings mirror it exactly — a non-code spec, then a plan pass that emits a plan artifact, then dependency-ordered tasks, only then code. And critically, they separate writing the plan from executing it, so the plan can be handed to a fresh worker with, as Superpowers puts it, “zero context for our codebase.” That’s not novel; it’s design-doc-before-code, a discipline older than anyone reading this. The agents just made skipping it expensive again.

One honest seam, because the home team gets graded hardest: nobody in this field has published a controlled result showing plan-then-execute beats careful one-shot prompting for coding. The closest outside evidence is on reasoning tasks. The structure is convergent and defensible; the proof that it ships better code is, flatly, not on the board. Hold that — it’s where this whole season lands.

A worn brass-and-iron lever mechanism thrown hard to one stop, a pass-or-fail gate lit warm against a navy field.

Angle two: the verdict goes to a machine

An executed test decides, not the model's word. The verdict goes to a machine.

The second move is about who gets to say the work is done. The strong teams took that decision away from the model.

The desk is unambiguous on why. Intrinsic self-correction doesn’t work — and the cleanest statement of it is DeepMind’s own, which is to say a finding published against the platform-builder’s interest:

LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction.

So you don’t ask the model to grade itself; you give it an external check it doesn’t control. ReAct showed the shape years ago — hand the loop a real information source and it “overcomes issues of hallucination and error propagation” that pure internal reasoning falls into, beating its baselines “by an absolute success rate of 34% and 10%.” An outside signal breaks the error loop; more thinking-in-place doesn’t.

The field converged on exactly that, and it gates correctness on something a machine decides. Superpowers’ test-first rule is the borrowed master here — Test-Driven Development, a convention with decades on it — re-aimed at an agent. Its gate is an executed observation, not a vibe:

If you didn’t watch the test fail, you don’t know if it tests the right thing.

Work that line, because it’s doing real epistemics in one sentence. A test written after the code passes immediately, and passing immediately proves nothing — you never saw it catch anything. You have to watch it fail for the green to mean the gate has discriminating power. That’s why the Iron Law is “NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST,” and why “Agent said success → Verify independently” — the model’s report of done is treated as a suggestion, confirmed by running the command and reading the exit code. The same instinct, taken to its mechanical extreme, is agent-spec, a Rust compiler that shells cargo test and derives the verdict straight from the process status — output.status.success() becomes Verdict::Pass, anything else Verdict::Fail, with each spec scenario bound to a named test. No model in the loop at the moment of judgment.

The honest caveat rides along: an executed test is only as good as what it asserts, and in agentic coding that test is often written by the same model that wrote the code — a green suite then certifies “the code matches the model’s understanding,” not “the code is correct.” External in mechanism, only partly external in ground truth. The play is real; it isn’t magic.

An open stat-keeper's ledger, dog-eared pages ruled into columns of tally strokes and check marks, lit warm against a deep navy night.

Angle three: the state goes in a ledger

Track progress in a file, not in the window. The state goes in a ledger.

The third move is the one the field added — the ceremony that didn’t exist in the old design-doc playbook and earned its way in. Push durable state — progress, decisions, what’s done — into a file the model re-reads, and treat the live context window as scratch paper.

This is the collection’s strongest single piece of tape, so I’m going to give it the mic in full. It’s a Superpowers commit, landed 2026-06-10, and it reads:

Conversation memory does not survive compaction. In real sessions, controllers that lost their place have re-dispatched entire completed task sequences — the single most expensive failure observed. Track progress in a ledger file, not only in todos.

Sit with what that’s confessing. Conversation memory does not survive compaction — when the window fills and the agent summarizes itself to keep going, the summary is lossy, and the thing it loses is its place. Re-dispatched entire completed task sequences — the agent forgot it had already done the work and did it again, on the clock, burning tokens to rebuild what it already had. The fix isn’t a smarter summary; it’s to stop trusting memory at all and write the state to a file — “not only in todos.” The plan said it cleanest: the play is to externalize the state because the window will rot and the model cannot feel it rotting.

Now the discipline this booth owes you. That phrase — “the single most expensive failure observed” — is one team’s recalled sessions, the home team’s, and it is the sole witness for that characterization. I’m not going to launder it into a field measurement. What corroborates it is the mechanism, arrived at independently: Anthropic’s own long-running-agent harness reaches for the identical move, a claude-progress.txt the agent writes so a fresh session “can quickly understand the state of work,” and concedes plainly that “Compaction isn’t sufficient.” Two unrelated teams, same conclusion — don’t trust the summary, keep a durable record. That convergence is the evidence. The dollar figure is just one team’s scar.

A single crisp white chalk line dividing two empty lanes of worn field turf, lit warm down the center against navy dark.

Angle four: the boundary goes into isolation

One writer per tree, never two on the same file. The boundary goes into isolation.

The fourth move is about never letting two writers touch the same tree at once. It’s the quietest of the four and the one hardening fastest, because it’s a hard truth inherited straight from version control: concurrent writes to shared state produce merge conflicts. The agents didn’t invent that. They re-derived it.

Superpowers encodes it as a flat prohibition. From the initial public release, in the NEVER list of its delegation play, sits the line:

Dispatch multiple implementation subagents in parallel (conflicts)

— the parenthetical is the whole argument. Parallelism itself isn’t banned; the framework happily fans agents out across independent failures. What’s forbidden is two implementers writing the same code at once. wshobson’s agent library reaches the same rule through a different mechanism — a file-ownership ladder that ends “Never: Let both modify the same file simultaneously.” And ccpm reaches it through a third: it slices an epic into bounded tasks and runs up to about five agents inside one shared worktree, keeping them off each other not by physical separation but by partitioning the files and tasks up front, so no two agents are ever assigned the same file. (Worth correcting a tempting misread: ccpm’s isolation is the epic-from-main worktree, not a separate tree per agent — the conflict-safety comes from the decomposition, which only reinforces the point.)

So how do we know this convergence is the game forcing their hand, and not four teams copying one blog post? That’s the necessity test the film room runs, and it’s worth being precise about what it does and doesn’t show. The independence backbone is dated: ccpm’s file-partition instance (its repo dates to 2025-08-18) and Anthropic’s multi-agent guidance (2025-06-13) both predate Superpowers (2025-10-09), so Superpowers couldn’t have seeded them. And the mechanisms diverge — forbid-parallel-implementers, avoid-shared-context, partition-the-files — where straight copying tends to reproduce one mechanism, independent arrival reproduces the shared constraint. One correction the home team has to wear: wshobson’s file-ownership rule was actually committed 2026-02-05, months after Superpowers, so that pair can’t claim independence on timeline — it survives only on mechanism. The verdict holds on its other legs, but read it for exactly what it is: passing this test means a play is most plausibly environment-imposed and durable, not that it’s optimal or proven best. It bounds one direction of copying. It doesn’t prove anyone worked in a clean room — everyone here reads the same vendor docs and runs the same harness, and that confound never fully clears.

There’s a sharper line inside this angle, too. The principle — externalize durable state — passes as necessity. The specific form the home team and Anthropic share, a near-identical progress text file, does not: Anthropic published that form months before Superpowers landed its ledger, and Superpowers eval-tunes on Anthropic’s platform. Most likely convergence-by-common-platform. The principle is physics; the file format is fashion. Keep them separate.

The house is already shipping it

Here’s the tell that ties the four angles together, and it’s the throw that hands forward to the back half of the season. If you want to know which community plays were real, watch which ones the platform vendor turned into buttons.

Both of the constraint-forced plays — the ones that passed the necessity test — are now native. Write-isolation shipped as a Claude Code flag; the docs describe it in the plainest terms: “Running each Claude Code session in its own worktree means edits in one session never touch files in another.” Externalized state shipped as a memory tool the same model generation that brought the million-token window — which is the vendor quietly conceding that a bigger window is not a substitute for writing state to a file. The hand-built play becomes the field itself.

That’s the bridge to where this is going, and the connection upstream to the rest of the pipeline. Once the platform absorbs the plan, the verdict, the state, and the boundary as primitives, the question stops being “did the team invent the mechanism” and becomes “what judgment do you encode on top of it” — which is the whole game of engineering the prompt sequence before any agent output exists. Get that wrong and you’ve shipped legacy code at birth. Get it right and you’ve at least loaded the dice.

So that’s the play that wins games, called straight: externalize the four things the model can’t be trusted to hold, because the work itself forces it — finite context, lossy compaction, unreliable self-grading, conflicting writes. Real, convergent, environment-forced, and now platform-blessed. What none of it is, yet, is proven to ship better code than competent direct prompting — and that absence is exactly why a judgment in the booth still matters: when no scoreboard can tell the necessary play from the merely popular one, reading the record is the instrument that can. That’s the grading rubric for the rest of the night. The next two plays both look like winners. Run them against this one, and only one survives.

This is the second call in Color Code — the middle of a three-part arc on the agentic pipeline. It looks upstream of Object Code, which covered accepting the output the agent hands back; here we’re on how the prompt sequence gets engineered before that output exists. Previously: A Star Is a Bookmark, on why you don’t trust the standings. Next, the most popular play in the league — and the thin case the agents actually need it: The Plays the League Borrowed.