The Plays the League Borrowed
The most popular convention, and the thin case that the agents actually need it.
Synthesized from a 16-document research corpus across five cascaded domains (15 lenses). The method here is archival: the convergence was re-derived against each team's own record — repository creation dates from the GitHub API, verbatim reads of the spec-kit and OpenSpec READMEs, the installed Superpowers v6.0.3 skill files, and the live issue thread behind the over-specification fight. The peer-reviewed decomposition anchors were existence-checked with matching authors, venues, and years. Source-reviewed, fact-reviewed, and gap-reviewed; the cross-domain synthesis passed an independent fidelity review before publication.
Run the tape on the league’s signature play and every team’s film shows the same opening: a non-code spec, then a planning pass, then code — never code first. Five teams, one shape. Then you pull the paper trail, and it turns out they didn’t each arrive at it. They got it from each other.
A line on where the booth is sitting, because this is the middle call of the season. Run your eye down the roster — spec-kit, OpenSpec, the rest of the spec-driven crowd, and the home team standing right in it — and you’re looking at the field that builds tooling for one problem: how a feature spec becomes the sequence of prompts that gets good code out of an agent. The home team there is Superpowers, the framework I run daily and file bugs against, and it runs tonight’s play too — which means it gets graded exactly as hard as the rest, not softer. Last segment was the play that wins games: get what the model can’t be trusted to hold out of its head and onto durable ground. This one is the play everybody runs — and the thinnest case that the agents specifically need it.

The most popular play fails the test the cheapest
The teams didn't each re-derive the play. They inherited it — every binder a copy of the one before.
The last call left a grading instrument on the table, so let me use it. A move earns the word necessity when unrelated teams reach it through different mechanisms with no shared lineage — when the environment forced their hand rather than a post everyone read. The plan-on-paper and write-isolation plays passed that test. Run spec→plan→implement against it and it fails more cleanly than anything else in the league, and it fails for the most ordinary reason: the teams aren’t independent. They’re reading each other.
Spec-kit says so in its own README, against the usual instinct to claim you invented your method:
This project is heavily influenced by and based on the work and research of John Lam.
Heavily influenced by and based on. That’s not a team that re-derived a spec-first pipeline from the physics of working with an agent; that’s a team naming the upstream it inherited from, on the record, in the first file you’d read. OpenSpec gives it away from the other direction — it positions itself by contrast, sizing up the same neighbors it’s competing with: “vs. Spec Kit (GitHub): Thorough but heavyweight. Rigid phase gates, lots of Markdown, Python setup.” You don’t benchmark yourself against a rival you arrived at independently. You do it when you’re all working the same known shape and selling a different cut of it.
And the shape has a name they all use — “spec-driven development” — which is the tell that finishes the case. It’s a re-skin of design-doc-before-code, a discipline with decades on it, agile practice with a new sticker. The teams shipped within a sixteen-day window in the late summer of 2025, which proves they didn’t copy each other’s released products. It proves nothing about the shared upstream they all drank from. Real convergence, genuinely — but here, ubiquity is weak evidence that the agents demand the play. Mostly it’s evidence the play was already lying around.
Steal the structure, not the notation
Failing that test does not make the play wrong. This is the trap, and the booth’s job is to not fall in it: “everyone does it” being fashion-grade signal is not the same as “everyone is wrong.” The opening is probably good practice. The honest move is to separate the part of it that earns its keep from the part that’s just worn smooth by handling.
The part that earns its keep is the structure underneath the ceremony — decompose the problem into ordered subtasks, and split writing the plan from running it. That part has outside support, and the last segment already cashed the strongest chip on it: least-to-most decomposition, the ≥99%-versus-16% result whose number is a 2022-model artifact but whose shape — ordered subproblems, each leaning on the last — transfers. The plan/execute split has its own thinner anchor, again from reasoning rather than coding. Plan-and-Solve prompting — “first, devising a plan to divide the entire task into smaller subtasks, and then carrying out the subtasks according to the plan” — “consistently outperforms Zero-shot-CoT across all datasets by a large margin.” Read that with its scope stapled on: the paper runs “ten datasets across three reasoning problems” and never touches software engineering. The plan-then-execute idea has real legs on math and symbolic tasks; carrying it onto a multi-file coding agent is extrapolation, not measurement.
So the structure is defensible beyond folklore. What’s borrowed — inherited, unforced, fashion-grade — is the heavy notation stacked on top of it: the four-command pipeline, the phase gates, the markdown ceremony OpenSpec itself calls heavyweight. Steal the decomposition and the plan/execute boundary. Don’t assume the industrialized pipeline comes with them.
The notation is a bet nobody has scored
There’s a second, louder claim hiding inside the popular play, and it’s worth saying plainly that it’s a bet. The field’s spec artifacts sit on a real, visible spectrum — disciplined prose at one end (spec-kit’s spec.md, “testable” requirements and “measurable” success criteria), structured Given-When-Then scenarios in the middle (OpenSpec), and a machine-verifiable contract at the far end (agent-spec binds each acceptance scenario to a named test function). The implied story is a ladder: the more verifiable the spec form, the better the output.
No team has published a controlled comparison showing it. Whether prose, scenarios, or an executed contract ships better code is, flatly, data not found. The progression looks like a ladder; it’s an unmeasured directional bet. And the camp can’t even keep its own story straight — spec-kit, a leading spec-driven tool, makes test tasks optional, generating them “only … if explicitly requested … or if user requests TDD approach.” A field that believed machine-verifiability was the point wouldn’t make it opt-in. The far end of that spectrum — agent-spec’s executed-test contract — is the most interesting thing on the board and the one with the least adoption behind it, which is a back-half story for another segment.

How a young craft mints a convention
Folklore hardening into convention: the move gets pressed into a repeatable mark, then copied off the next repo.
None of this is a scandal. It’s just what a craft looks like before it has any held-out results — which is to say, what this one looks like right now. The sociology of practice has the vocabulary for it: a community of practice develops, in Wenger’s words, “a shared repertoire of resources: experiences, stories, tools, ways of addressing recurring problems—in short a shared practice.” That last clause is the precise name for folklore hardening into convention. Newcomers don’t read a spec; they copy a CLAUDE.md pattern off a popular repo and absorb the move by imitation. Comparable tool ecosystems ran the same arc from folklore to convention to infrastructure — though the analogy illuminates more than it proves, since a linter has a deterministic output and an agent skill is a probabilistic heuristic, and they won’t necessarily codify the same way.
The home team is standing right in the middle of this, and it gets the same grade. Superpowers gates its opening behind a literal hard rule that admits no exceptions:
This applies to EVERY project regardless of perceived simplicity.
Universal spec-first ceremony — the full convention, run on every job. And the fairest check on whether the convention is doctrine or proven necessity comes from the team’s own issue tracker, where its power users are arguing about exactly this. In issue #895, a user reports plans so detailed they contained full function bodies and exact shell commands, with a complaint that lands on the seam:
The executor had no decisions to make — it was copy-pasting code from the plan rather than engineering a solution.
The maintainer’s answer is not a dodge — it’s an engineer defending a design with data:
this is very much how writing-plans was designed to operate. I’ve been running evals on a possible evolution of the planning process, but so far, the outcomes are all worse. The intent is that you should be able to use haiku (or even less capable models) as the implementers.
Sit on that for a second, because both sides are right and that’s the whole point. The user is right that an over-specified plan strips the implementer’s judgment. The maintainer is right that loosening it — running every alternative he can build through his own evals — has so far produced worse output, because the detail is what lets a cheap model implement safely. The full-detail plan is a bet the maintainer is actively defending against his own users’ lived experience, with private eval runs on one side and felt overhead on the other and no settled answer between them. That is not a community that has proven its signature play. It’s one still arguing about it in public — which is exactly where you’d expect a craft this young to be.
The call on the most popular play in the league: it isn’t a fraud, and it isn’t necessity. Steal the structure under it — the decomposition and the plan/execute split earned their place in the last segment — and hold the heavyweight notation as an open question, not a solved one, because no one has scored it and at least one team that ships it can’t agree with itself about it. “Everyone runs this” is the standings telling you it caught on. It is not the tape telling you the agents forced it. The booth’s whole reason to be here is keeping those two apart, and on this play they’re further apart than they look.
Which sets up the next call, and a harder one — because the next play doesn’t just look popular. It looks like winning. Ten agents working at once, the flashiest thing in the league, the move the box score loves most. Run it against the same rubric and watch what the budget-controlled tape does to the lead.
This is the third call in Color Code — the middle of a three-part arc on the agentic pipeline, looking upstream of Object Code at how the prompt sequence gets engineered before any output exists. Previously: The Plays the Game Wrote, on the move the game itself forced. Next, the play that looks most like winning and isn’t: The Plays the Box Score Loves.