The Plays the Box Score Loves

Why a swarm of agents buys speed, not quality.

EP 4/6 June 26, 2026 ~10 min read

Listen:

Synthesized from a 16-document research corpus across five cascaded domains (15 lenses). The method here is archival: the live argument was read straight from the home team's own record — the GitHub issue threads and the commit that reversed a play, retrieved as API JSON with every quote verbatim from the payload. The one controlled result behind the call was existence-checked against its primary page (real authors, exact title, finding direction), and the vendor's confound was read against its own interest. Source-reviewed, fact-reviewed, and gap-reviewed; the cross-domain synthesis passed an independent fidelity review before publication.

Ten agents working at once is the most exciting thing in this league. It photographs like progress — a swarm of workers chewing through a feature in parallel, the box score lighting up, the wall clock falling. Then you hold the token budget fixed, ask whether all that parallelism bought quality rather than just speed, and watch the lead evaporate.

A line on the vantage, because any call here should survive as your first. All season the booth has been grading one craft: how a developer gets a feature spec into the prompt sequence a coding agent does its best work from. The home team in the booth is Superpowers, the tool I actually run — which on this play lands on the right side of the call and still gets graded exactly as hard as anyone. Last segment graded the most popular play and found fashion where it claimed necessity. This one grades the flashiest play, and the trouble with it is sharper: it doesn’t just look popular. It looks like winning.

So let me separate the two things this move is actually selling, because the whole call turns on keeping them apart. Fanning agents out sells speed — throughput on work you can genuinely run side by side — and it sells quality, the idea that dividing the cognitive labor across specialists produces better output than one agent doing the thinking. The first is real engineering. The second is the one the box score can’t distinguish from the first, and the booth’s entire job tonight is to.

A reel of spooled game film standing beside a fixed upright measuring gauge, the reel held against one mark on the gauge under a warm booth light.

Run the budget-controlled tape

Hold the spend fixed and read the tape against the gauge. Normalize the compute and the lead evaporates.

Start with the strongest number anyone has put on the board for the many-agents move, because it comes from a vendor with every reason to want it to win. Anthropic reports that a multi-agent system “outperformed single-agent Claude Opus 4 by 90.2%” on its internal research eval — and then, in the same write-up, hands you the asterisk against its own interest. The win is confounded by spend: “token usage by itself explains 80% of the variance,” and the multi-agent setup burns “about 15× more tokens than chats.” A 90% lead bought with fifteen times the compute is not evidence the architecture is smarter. It’s evidence you spent more. And the vendor goes one step further and tells you where the play doesn’t fit:

Most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time.

Read that for what it concedes. The result is from research — read-heavy, embarrassingly parallel — and the same engineers who measured the win are flagging that coding is the wrong shape for it: write-heavy, densely coupled, “fewer truly parallelizable tasks.” The headline travels with its scope stapled on.

The clean test holds the budget fixed and asks the question directly. A controlled study — Tran and Kiela, a preprint, so weight it as one not-yet-replicated result — normalizes the compute and lands on the null:

When computation is normalized, single-agent systems (SAS) can match or outperform MAS.

Work that hinge. When computation is normalized — when the swarm doesn’t get to spend more — its edge disappears: across three model families, single agents “consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant,” and the reported multi-agent advantages are “better explained by unaccounted computation and context effects rather than inherent architectural benefits.” Spend, not structure. Carry the one caveat that keeps this honest: the experiment is on multi-hop reasoning, and the paper is silent on coding — it transfers to a repo by analogy, not by measurement. So the null isn’t a single proof; it’s a composite. One independent control pointing at the null, a builder shipping a rival product saying the same — Cognition, who sell a single-agent coder, call running “multiple agents in collaboration only results in fragile systems” because “the decision-making ends up being too dispersed” — and a 1,600-trace failure taxonomy (MAST) that exists at all only because coordination breakage was common enough to need cataloguing. Three sources, opposite incentives, same direction.

What none of them is, is a measurement that fanning out worsens shipped code. The precise call is narrower and it matters: dividing labor across agents to buy quality is unproven, not established — and counter-indicated for the coupled, repo-level work the play is sold for. Not debunked. Not established. The flashy version is spending your way to a lead the budget-controlled tape erases.

What the swarm was actually doing

Here’s the tell that the field already knows this, and it’s the home team’s own rulebook — which is why it gets the hard grade, not a pass. Superpowers is the most enthusiastic sub-agent shop in the league, and it still won’t fan out coders. It dispatches in parallel only when the work is provably separate:

Dispatch one agent per independent problem domain. Let them work concurrently.

The trigger is narrow on purpose — “3+ test files failing with different root causes,” no shared state — and the flagship delegation play bans the other thing outright, listing “Dispatch multiple implementation subagents in parallel” under what it will never do, with the one-word reason in the parenthetical: conflicts. That’s the speed/quality line drawn inside a single team’s config. Fan out for throughput on slices that can’t collide; keep a single writer everywhere the work is coupled. The team that loves sub-agents most reserves them for context hygiene and independent failures, and treats many-coders-on-one-feature as a hazard, not a multiplier.

Which raises the obvious question: where do the apparent wins come from, then? When a multi-agent build does come out better, what improved it? Trace it and the credit keeps landing somewhere other than the parallelism. The quality comes from the decomposition done up front — slicing the problem so the slices are independent is itself the engineering — and from the executed verification gates each slice runs. ccpm, an orchestration framework, even attributes its own headline bug-rate drop to “detailed task breakdown before coding,” not to the agents running at once. The parallelism was wearing the decomposition’s jersey. (There is one narrow exception — function-level coding puzzles, where dividing roles does lift the score — and the closer takes that up, because the lift there traces to the test-debug loop, not the fan-out.)

A worn coach's headset with scuffed leather earcups and a dulled metal band resting on a dog-eared clipboard on a weathered bench, no one wearing it.

The sideline reports

Grievances on a channel that selects for complaints — one reporter each, the bill logged in their own words.

The cleanest evidence that the flashy version costs more than it returns isn’t in a paper. It’s on the home team’s own issue tracker, where independent users log the bill in their own words. Weight these as what they are — grievances on a channel that selects for complaints, one reporter each, not a defect rate — but they are the demand-side reality a README never shows.

From issue #743, a user on a trivial job: “A simple task of creating a trivial model + CRUD endpoint is becoming a 35 minute session. With plain claude code it’s done in 2 minutes.” From #1120, the same failure quantified: a five-line file generating “6–7 agent invocations” at “roughly 10–15x what direct file creation would have cost,” because “each agent in this chain starts cold — zero context from previous agents.” From #1194, a small app reporting “68M tokens, ~3300 lines of code,” with “22M tokens just for planning.” Every one localizes the cost to the same place — the cold-start, multi-agent review stage, not the building.

And the counterweight belongs in the same breath, or this tips into the takedown it isn’t. The tracker that carries those complaints carries roughly 616 closed issues against its open ones — a tool being used hard, not a tool failing — and one of the same users filing grievances also writes that “these skills are amazing and I rely on them every day.” A busy tracker is engaged adoption arguing with itself. The cost is real; so is the value. Both are on the tape.

A worn brass balance scale with two empty pans hanging dead level on a horizontal beam, perfectly balanced, under a warm booth light fading to navy.

The argument nobody’s won

An empty scale held dead level. The booth refuses to call a winner because the data won't name one.

Now the part that makes this the most alive play in the league, and the reason the booth refuses to call a winner. The fight is over a review loop — and the first thing to get right is that there are two of them, because conflating them invents a contradiction that isn’t there.

The maintainer killed one himself. The plan-review loop — a fresh agent dispatched to review the plan before any code got written — was removed in a commit whose message is its own verdict:

The subagent review loop (dispatching a fresh agent to review plans/specs) doubled execution time (~25 min overhead) without measurably improving plan quality. Regression testing across 5 versions (v3.6.0 through v5.0.4) with 5 trials each showed identical plan sizes, task counts, and quality scores regardless of whether the review loop ran.

That’s an engineer cutting his own ceremony on his own data: doubled execution time without measurably improving quality, across five versions and five trials each, so it went. Hold that, because it means the maintainer is not a man who keeps review loops out of dogma.

Which is what makes the second loop the genuinely hard one. The implementation review — an implementer, then a spec-review, then a quality-review, per task — he keeps, and defends, against users reporting the same emptiness on mechanical work. One of them, in #1120, ran it across a batch of near-identical migrations and reported the receipt: “running implementer + spec-review + quality-review per file cost roughly 3x the actual edit time and surfaced zero new issues the first round hadn’t already covered.” Three times the cost, nothing caught. The maintainer’s answer is not a brush-off — it’s an engineer defending a design with his own evals:

These reviews are a big part of why superpowers actually generates good outcomes and I don’t yet see any way of making them go away.

Work both sides, because both are right. The user is right that on eight more files of a now-approved three-line pattern, the review wave is pure overhead. The maintainer is right that when he tried to build the obvious fix — let the model decide when a task is simple enough to skip the checks — “the model used that as permission to skip steps when they really needed to be mandatory.” The cheap gate that would satisfy the user is the exact gate his data says the model abuses. There is no complexity threshold on the board that gives the power user his two minutes back without handing the model an excuse to cut a corner that matters. That absence — not a bug, not a winner — is the finding. It’s also the seam the next segment runs through: the most-demanded thing in this whole field is a gate that decides when the ceremony is load-bearing and when it’s theater, and nobody has shipped it.

So here’s the call on the play that looks most like winning. Fanning out buys you wall-clock on work you have already proven independent, and that is real engineering, worth doing exactly that far. It does not buy you quality by dividing the thinking, and at a fixed budget a single careful agent is the number to beat, not the fallback you reach for. The swarm photographs like progress because motion photographs like progress — but the budget-controlled tape shows the progress was in the decomposition and the gates the whole time, with the parallelism borrowing their jersey. And the one thing no scoreboard will settle for you — when a review is load-bearing and when it’s ceremony — is precisely the call a senior makes by reading the argument instead of the standings. The league hasn’t built the gate that decides it. Until it does, that judgment is the gate.

This is the fourth call in Color Code — the middle of a three-part arc on the agentic pipeline, looking upstream of Object Code at how the prompt sequence gets engineered before any output exists. Previously: The Plays the League Borrowed, on the popular play with the thin case. Next, the back half of the season — where the platform quietly ships the community’s best plays as built-ins, and the unbuilt gate this argument keeps circling: The House Always Catches Up.