The Evaluator's Dilemma

The pipeline's evaluation system said the story was alive. Then it scored homogeneity resistance at 3 out of 5. Both findings came from a judge the research says is wrong 29% of the time.

March 24, 2026 · ~25 min read

Listen:

Synthesized from 41 research documents across three domains, a corpus study of nine MacDonald works (408,000+ words), and the pipeline's own evaluation output. All evaluation scores are primary evidence from a production pipeline run.

The evaluator said: “This story is alive. It is, without qualification, the most convincing sustained MacDonald pastiche I have encountered.” Then it scored homogeneity resistance --- the dimension that measures whether the prose sounds like a machine producing variations on a theme --- at 3 out of 5. The overall faithfulness score was 4.55 out of 5. Worldview integration, the dimension the Creative Intelligence synthesis identifies as hardest to automate and most important to readers, scored a perfect 5.

Those numbers came from the same LLM judge that peer-reviewed research says agrees with expert human evaluators only 71% of the time on creative quality assessment. And the person who designed the evaluation rubric is the same person who designed the pipeline it evaluates.

Every layer of this assessment has a structural flaw. All of them produced numbers anyway. This piece is about what those numbers mean, where they fall apart, and why I built the system knowing it would be imperfect --- because the alternative was building nothing and measuring nothing, which is what most agent engineering projects do when the evaluation problem gets hard.

The Evaluation Problem Has No Clean Solution

Evaluation is the hardest unsolved problem in agent systems. That claim sounds dramatic. It is also the single finding that all three research domains --- Agent Engineering, Creative Intelligence, and Implementation Strategy --- independently confirmed.

Agent Engineering established it from the production side: in a vendor-conducted survey of 1,340 practitioners by LangChain, 89% had implemented observability but only 52.4% had implemented systematic evaluation. The figures are directional, not definitive --- this is a vendor surveying its own ecosystem --- but the pattern is consistent with what the project found independently. Organizations can see their agents operating. They cannot systematically measure whether outputs are improving or degrading. The gap between watching and measuring is where most production agent systems live --- instrumented but unevaluated.

Creative Intelligence established it from the quality side: the dimensions of creative output that matter most to human readers are the dimensions that automated evaluation handles worst. LLM-as-judge achieves roughly 80% agreement with human experts on objective tasks --- factual accuracy, logical consistency, format compliance. Zheng et al. (2023) demonstrated this with MT-Bench, and the finding has replicated across multiple systems. But push the evaluation into subjective territory and the agreement drops to 60-68% across multiple studies in expert domains. For the specific task of distinguishing human writing from AI-generated text, Kirilloff et al. (2025) found that expert human judges achieve 94% accuracy while LLM judges manage 71%.

The gap is 23 percentage points. For context: if the evaluation system were a medical test, a 23-point accuracy gap between the automated screen and the expert diagnosis would be a crisis. In creative evaluation, it is the state of the art.

The pipeline had to build an evaluation system knowing its automated layer would be structurally miscalibrated for the thing it needed to measure most: whether the prose captures something that feels alive.

Why LLM Judges Get Creative Quality Wrong

The miscalibration is not random. It has a mechanism, and the mechanism matters for pipeline design.

Self-enhancement bias. LLM judges systematically prefer LLM-generated text. The bias operates through a perplexity-driven mechanism: text generated by language models has lower perplexity to other language models, because the generating model and the evaluating model share training distributions. The evaluator literally finds AI-generated text more “natural” than human text --- not because it is, but because it matches the evaluator’s own statistical expectations. When the same model family generates and evaluates, the result is circular validation. The pipeline is grading its own homework.

This is not a calibration problem that more data solves. It is structural. The evaluator’s preference for machine-generated text is a feature of how language models process probability distributions, not a bug in any particular prompt or rubric. Kamoi et al. (TACL 2024) and Huang et al. (ICLR 2024) established that LLMs cannot reliably self-correct without external grounding. The creative evaluation domain makes this finding worse: in reasoning tasks, “correct” is verifiable. In creative tasks, “good” is contested. The external ground truth that could calibrate the evaluator does not exist in any computationally accessible form.

The homogeneity blind spot. LLM judges are least reliable on the dimension that matters most for voice reproduction. The homogeneity problem --- LLM outputs converging toward the statistical center of training distributions, producing text that is 19-38% less semantically diverse than human text according to Wenger and Kenett (2025) --- is precisely the dimension where LLM evaluation is least trustworthy. An LLM judge evaluating whether AI prose has sufficient variation is an instrument miscalibrated for the thing it is measuring. The judge’s own training produces the same convergence it is trying to detect.

The complexity preference. LLM judges favor complexity over genuine creativity. They prefer longer responses, more elaborate constructions, more hedged and qualified prose --- precisely the features that LLMs themselves produce. When evaluating MacDonald pastiche, this creates a specific risk: MacDonald’s narrator is genuinely digressive, genuinely qualified, genuinely complex. An LLM judge may score highly on these dimensions not because the prose captures MacDonald’s voice but because it matches the judge’s own production tendencies. The overlap between “what MacDonald does” and “what LLMs do by default” is a confound the evaluation system cannot eliminate.

The Four-Layer Architecture

The cross-domain synthesis converged on a single conclusion: no individual evaluation method is reliable for creative output. The only viable approach is layered --- multiple methods addressing different dimensions, with human judgment as the final authority.

The pipeline implements four layers. Each addresses a different tier of the systematizability taxonomy that structures the entire project.

Four horizontal layers stacked vertically — Stylometric Baseline with measuring instruments at top, LLM-as-Judge with magnifying glass, Expert Human with eye motif, and Regression Tracking with chart lines at bottom — connected by thin lines showing the evaluation architecture — The four-layer evaluation architecture. Each layer addresses different tiers of the systematizability taxonomy — from mechanical measurement at the top to human judgment and trend tracking at the bottom.

Layer 1: Stylometric Baseline

What it measures: Surface features --- sentence length, vocabulary profile, punctuation frequency, function word ratios, register distribution. The measurable, mechanical dimensions of MacDonald’s prose.

How it works: Before the pipeline could evaluate anything, it needed a calibration target. No published stylometric analysis of George MacDonald existed. The pipeline built one. A corpus study analyzed nine MacDonald fantasy works totaling over 408,000 words from Project Gutenberg, extracting quantitative profiles across fifteen dimensions. MacDonald’s mean sentence length: 17.7 words, standard deviation 15.2. Function word ratio: 50.3%. Semicolon frequency: 8.3 per thousand words. Register split: approximately 75% Anglo-Saxon, 25% Latinate. These numbers became the baseline.

The voice specification that emerged from this study is the pipeline’s most iterated artifact. It encodes Tier 1 constraints --- the measurable features an agent can be instructed to reproduce --- as hard boundaries. Target a mean sentence length of 17-18 words. Maintain a function word proportion of approximately 50%. Use semicolons at roughly 8 per thousand words.

What it can tell you: Whether the generated prose lives in the right statistical neighborhood. If sentence length drifts to a mean of 25 or function words drop to 40%, something has gone wrong at a level the stylometric layer can detect and the pipeline can correct automatically.

What it cannot tell you: Whether the prose is any good. MacDonald’s sentence length distribution is a fact about his writing. Matching it is necessary but tells you nothing about whether the resulting text captures what makes MacDonald’s prose worth reading. The stylometric layer is the floor, not the ceiling.

One honest problem: The corpus study’s punctuation profile reflects Project Gutenberg’s digitized texts, which undergo editorial normalization. The voice specification instructs “never em dashes” based on this analysis --- but the story uses em dashes throughout. The evaluation’s Surface Features dimension flagged the inconsistency and scored it 4 out of 5. Whether this represents a pipeline failure to enforce the constraint or evidence that the corpus study’s punctuation norms reflect the digitizer’s conventions rather than MacDonald’s authorial intent is an open question. It is the kind of gap that emerges when you build a stylometric baseline from digitized Victorian texts and try to use it as a production constraint. The data is real; whether it measures what you think it measures is another matter.

Layer 2: LLM-as-Judge

What it measures: Six prose-level dimensions (Worldview, Narrative Logic, Thematic Consistency, Emotional Register, Voice, Surface Features) and three story-level dimensions (Narrative Completeness, Coherence Over Distance, Homogeneity Resistance). Each scored 1-5 with detailed reasoning.

How it works: The evaluation rubric was designed around the faithfulness hierarchy that emerged from the Creative Intelligence research: worldview > narrative logic > thematic consistency > emotional register > voice > surface features. This hierarchy inverts the naive approach. Most AI writing evaluation starts with surface features --- does the text sound right? --- and treats deeper dimensions as nice-to-haves. The Psychology of Mimicry lens found the opposite: readers of pastiche forgive surface deviations when the narrative world feels authentic but reject stylistically perfect surfaces that feel hollow. The depth-over-surface principle --- established through literary history in the Taxonomy piece --- is encoded directly in the rubric’s weights. The evaluation rubric weights accordingly: Worldview carries a weight of 6. Surface Features carries a weight of 1.

The nine dimensions are not arbitrary. They decompose voice along the three-tier systematizability taxonomy. Surface Features and Voice address Tier 1 (automatable). Emotional Register and Thematic Consistency address Tier 2 (partially systematizable). Worldview and Narrative Logic address Tier 3 (resistant to automation). Homogeneity Resistance is a meta-dimension --- it measures whether the pipeline’s output exhibits the convergence problem the homogeneity studies predict. It is the evaluation system checking for the failure mode that Wenger and Kenett and Kirilloff et al. document as near-universal in LLM output.

What it can tell you: Where the generated prose sits on each dimension, with reasoning that identifies specific passages and patterns. The evaluation output is not just a number --- it is a 2,000-word critical assessment that quotes the text, identifies strengths and weaknesses, and provides four specific homogeneity flags.

What it cannot tell you: Whether its own assessments are correct. The 71% figure looms over every score. When the evaluator says “This story is alive,” that judgment carries the full weight of the self-enhancement bias. The evaluator may be responding to the prose’s genuine qualities. It may be responding to its own statistical familiarity with the text’s patterns. The evaluation cannot distinguish between these two possibilities, and neither can I.

Layer 3: Expert Human Review

What it measures: Everything the first two layers cannot --- aesthetic response, reading experience, the ineffable question of whether the prose feels alive.

How it works: A human reads the output. Not skimming for errors. Reading. The kind of reading where you notice whether a passage makes you slow down, whether the narrator’s voice feels like a voice or a performance, whether the garden in the story is a place you can imagine being in or a description you are processing.

This is where the three-tier taxonomy reaches its practical limit. Tier 3 dimensions --- worldview integration, narrative timing, purposeful variation --- cannot be measured computationally or reliably assessed by an LLM judge. They can only be assessed by a reader who knows MacDonald’s work well enough to feel when the pastiche lands and when it misses. The human layer is not a nice-to-have quality gate. It is the only evaluation mechanism the cross-domain synthesis identifies as reliable for the dimensions that matter most.

What it can tell you: Whether the prose works as prose. Whether the narrator’s voice sustains across eleven thousand words. Whether the theological vision feels inhabited or applied. Whether the story makes you want to keep reading.

What it cannot tell you: Whether the human evaluator’s judgment is distorted by the same biases the self-enhancement literature identifies --- investment in the pipeline’s success, familiarity with the output, the difficulty of maintaining critical distance when you have designed every component of the system that produced the text. The solo-practitioner limitation is structural, and this piece will not resolve it --- Article 5 addresses it directly.

Layer 4: Regression Tracking

What it measures: Whether the pipeline’s output quality is stable, improving, or degrading across runs.

How it works: Each pipeline run produces evaluation scores. Across runs, the scores form a trend. If Worldview drops from 5 to 3 between runs, something in the pipeline changed --- a prompt modification, a context window adjustment, a voice specification edit --- and the regression tracker flags it.

What it can tell you: Whether changes to the pipeline improve or degrade output quality, as measured by Layers 1-3.

What it cannot tell you: Whether the baseline scores were right to begin with. Regression tracking measures change from a starting point. If the starting point was miscalibrated by self-enhancement bias, the regression tracker will faithfully track changes relative to a flawed baseline.

What the Evaluation Found

The story --- “The Walled Garden,” approximately 11,000 words --- was evaluated against the full rubric. Here are the scores, and more importantly, what the evaluator said about each one.

The Top of the Rubric

Worldview: 5/5. The evaluator’s reasoning: “Grace is structurally embedded throughout, never episodic. The garden does not reward good behavior --- it simply operates by its own laws.” It identified the story’s key theological move as deeply MacDonald: “pride as self-diminishment (the garden literally contracts around Emma’s arguments), transformation as subtraction rather than addition.”

Read the passage the evaluator is responding to:

She had not tended this bed. She knelt. I cannot tell you why she wept. She could not have told you herself; and I will not pretend to explain what she could not, for that would be the worst sort of dishonesty --- the sort that makes a tidy thing of what ought to remain untidy.

And later:

And the tears fell into the soil; and the soil received them as it had received the water from the can --- without distinction, without preference. Tears and well-water and rain are all one to a garden.

Whether this captures MacDonald’s worldview is a question the evaluation system asks but cannot definitively answer. The evaluator says yes with high confidence. Kirilloff et al. say the evaluator is wrong 29% of the time on creative quality assessment. The prose exists. You can read it and judge.

Narrative Logic: 5/5. The evaluator noted: “Emma does not decide to change; exhaustion empties her of strategies, and her hands learn what her mind refuses.” It identified the story’s structural principle as MacDonald’s “finding-without-seeking” --- transformation that arrives through surrender, not conquest.

Thematic Consistency: 5/5. The water-as-life thread was the evaluator’s primary evidence: “water carried resentfully (sits on the surface, refused by soil), water spilled accidentally (the fall that opens space for presence), water poured freely (the soil ‘drank at once, as a thirsty child drinks’), tears falling into the deepest bed, and finally rain on the window at home --- the same water, the same gift, circulating through the whole cosmos.”

The Middle

Emotional Register: 4/5. The evaluator found the trajectory “largely earned and bodily” but noted: “the emotional register occasionally tips toward a warmth that is slightly more performative than MacDonald’s own --- the narrator’s affection for Emma sometimes becomes a kind of display.” MacDonald’s wonder, the evaluator argued, “is typically more austere, more bewildered, less rhetorically assured.”

Voice: 4/5. This score cuts close. The evaluator called the intrusive narrator “the story’s most striking achievement and its most persistent risk.” It identified something precise: “MacDonald digresses, but he also writes long stretches of clean narrative without commentary. This story’s narrator almost never lets a moment pass without glossing it.” The voice was “recognizably in MacDonald’s tradition but slightly over-concentrated --- it has absorbed his most distinctive mannerism and applied it more uniformly than he would.”

Read the passage and consider the evaluator’s claim:

But I must tell you something about questions before we go on, and I am not sure that I can tell it rightly, but I will try. A question that you have polished and carried about with you and said over to yourself is not, properly speaking, a question at all. It is a little performance.

That is the narrator at full expansion --- the self-interrupting, qualifying, circling voice that the evaluator says the story applies too uniformly. Whether the evaluator is right that MacDonald would vary the pattern more --- letting some moments pass without commentary, delivering a single-sentence aside and moving on --- is exactly the kind of assessment where the 71% figure matters. An expert MacDonald reader might agree, or might argue that this level of digression is exactly what MacDonald does in his most characteristic passages. The evaluation cannot settle the question. It can only raise it.

Surface Features: 4/5. The evaluation flagged two issues: inconsistent use of hyphens and em dashes in early sections, and one instance of italicized internal thought (“a snail”) that is a modern convention MacDonald would not have used. These are the kinds of findings the stylometric layer is built to catch --- and in this case, did catch but the pipeline did not fully correct.

The Bottom

Narrative Completeness: 5/5. “The story feels whole. The ending transforms the beginning’s promise with remarkable precision.” The evaluator tracked the water motif from resentful carrying through the spill, through the tears, to the glass of water on the kitchen table throwing gold light. “The rain that begins at the close completes the water cycle that structured the entire day.”

Coherence Over Distance: 4/5. A seam was identified: “The transition from the garden’s deepest bed to the walk home is the most visible join --- there is a slight gear-change as the prose shifts from the garden’s enclosed intensity to the more expansive register of the homeward walk.” The evaluator also flagged the hedgerow-thinning passage as introducing a mode of vision from a different MacDonald tradition --- “more Lilith than Princess books” --- that does not quite match the garden’s logic.

The Score That Matters Most

Homogeneity Resistance: 3/5. This is the lowest score in the evaluation. It is also the score the three-tier taxonomy predicts would be lowest, because homogeneity is the documented, consistent, measurable failure mode of LLM-generated text.

The evaluator identified four specific flags:

Flag 1: Predictable macro-rhythm. “Three consecutive passages follow the same pattern: Emma performs an action, the narrator interrupts with ‘Now I must say’ or ‘But I must tell,’ extended philosophical reflection, return to action.” MacDonald would vary this by allowing some actions to pass without commentary.

Flag 2: Recurring qualifier constructions. The phrases “and I think,” “but I must,” “for there is a kind of,” “I do not know,” “I cannot tell you” recur at high frequency across all sections. “In the opening, middle, and closing, the narrator sounds the same pitch of humble uncertainty. MacDonald’s own narrators modulate --- sometimes confident, sometimes bewildered, sometimes playful, sometimes grave.”

Flag 3: Repetitive analogical structure. “The story relies heavily on a specific analogical form: ‘X is like Y, which is to say Z’ --- or its variant, ‘not X but Y, and Y is nearer to Z than we suppose.’” The construction appears in nearly every paragraph of the final third. “The analogies are individually strong but structurally repetitive.”

Flag 4: Uniform emotional temperature in closing. “The return-home section maintains a consistent register of tender wonder that does not vary enough internally. The snail passage, the door passage, the cats passage, and the kitchen passage all operate at the same emotional temperature.” MacDonald’s domestic closings typically include a sharper note --- comedy, practical domesticity, mild irritation --- that prevents the warmth from becoming uniform.

These four flags describe the homogeneity problem made specific. The pipeline’s output has absorbed MacDonald’s most distinctive mannerisms and applied them with a regularity that MacDonald himself avoids. It is convergence toward the statistical center of MacDonald’s voice --- the most frequent patterns over-represented, the less frequent variations under-represented. This is precisely the variability deficit Wenger and Kenett documented --- the measurable narrowing of semantic range that distinguishes machine-generated text from human text. The pipeline does it with MacDonald’s own patterns rather than generic LLM patterns, which is an achievement of the voice specification and the hierarchical generation architecture. But the convergence is there.

The Gap Between Scores and Reading

Here is the tension the evaluation system cannot resolve.

The evaluator wrote, in its narrative satisfaction assessment: “What works most deeply is the structural embedding of worldview. Emma’s transformation is not narrated as moral progress; it is enacted physically --- through blisters, through the weight of water, through hands that learn before the mind consents.”

Read the passage it references:

She had nothing left with which to rise. That was the simple thing. Her body had come to the end of what it would do for her pride, and it stopped, as a horse stops, without consulting her.

And later, after the turn:

I do not think she saw all this with her understanding; she saw it with her hands and her knees and the small of her back, which is a truer kind of seeing, though the philosophers have not much to say about it.

The evaluator scored Worldview at 5. It scored Homogeneity Resistance at 3. Those scores describe the same story. The worldview is inhabited with genuine depth. The inhabiting is done with insufficient variation. The prose reaches MacDonald’s most important dimension --- the theological grammar, the grace-as-ambient-condition --- and fails at the dimension Wenger and Kenett’s variability research says LLMs fail at by default: producing enough variation to sound like a living voice rather than a statistical composite.

The honest question: is the evaluator right about both? Can a story be deeply faithful to an author’s worldview while simultaneously exhibiting the homogeneity that marks machine-generated text? Or is the 3 on homogeneity a miscalibration --- the evaluator detecting a pattern that is present but not as damaging as the score suggests?

I do not know. The evaluation system cannot answer this question about itself. The evaluator’s assessment of homogeneity is subject to the same 71% reliability ceiling as every other creative judgment it makes. It may be correctly identifying a real limitation. It may be over-weighting the pattern because its own training makes it hypersensitive to repetition in ways that human readers are not.

What I can say is this: the four homogeneity flags are specific, identifiable, and verifiable. You can open the story and find the recurring “and I think” constructions. You can trace the scene-digression-scene-digression macro-rhythm through the middle sections. The flags are not hallucinated. Whether they constitute a score of 3 rather than 4, whether a human reader would weight them the same way --- that is the 29% gap in action.

The Evaluator Evaluating Itself

The narrative satisfaction assessment contains a passage that captures the evaluation system’s own limitation more precisely than I could state it:

The effect is still moving, but it is moving in the way that a very good sermon is moving, rather than in the way that a walked-through garden is moving --- and the story knows the difference between those two things, which makes the failure to fully enact the difference all the more pointed.

The evaluator is describing the gap between what the pipeline achieved and what MacDonald achieves. The story understands that silence is more powerful than commentary --- the narrator says so explicitly --- and then fails to practice the silence it preaches. The evaluator can see this. The evaluator can articulate it with considerable precision. But the evaluator is the same class of system that generated the prose, and its ability to articulate the problem is not evidence that it can correctly weight the problem’s severity.

This is the evaluator’s dilemma in its most concentrated form. The automated layer is not useless. It produces detailed, specific, often insightful critical analysis. The four homogeneity flags are better literary criticism than most automated evaluation systems produce for any creative task. But the question of whether the evaluation is calibrated --- whether a 3 means what a 3 should mean, whether the overall 4.55 reflects genuine quality or self-enhancement bias inflating the score --- cannot be answered from inside the system.

WritingBench’s dynamic evaluation approach achieves 83% alignment with human judgment, which may approach ceiling performance given that human-human inter-rater reliability for creative writing sits somewhere in the 70-85% range. The 85% ceiling figure is itself unverified --- it comes from general inter-rater reliability research, not from MacDonald-specific evaluation. But it suggests something important: even perfect automated evaluation would disagree with any given human expert 15-30% of the time, because human experts disagree with each other at that rate.

The evaluation system is not broken. It is operating near the boundary of what evaluation can do for creative output, and that boundary is lower than most engineering projects acknowledge.

Why Build It Anyway

A reasonable question: if the automated evaluation layer is structurally miscalibrated for the dimensions that matter most, why build it?

Three reasons.

First, because the alternative is nothing. Most agent engineering projects that tackle creative output skip systematic evaluation entirely. They generate text, a human reads it, the human says “looks good” or “try again,” and the process is undocumented. There is no rubric, no scoring, no specific identification of what works and what does not. The absence of evaluation is not humility about measurement’s limits. It is avoidance of the hardest problem in the system.

The pipeline’s evaluation may be miscalibrated. It is also specific, documented, reproducible, and honest about its limitations. That is a different kind of artifact than “I read it and it seemed fine.”

Second, because the evaluation problem is the portfolio problem. The research identified evaluation design as the strongest portfolio differentiator in agent engineering --- precisely because most practitioners skip it. Building a four-layer evaluation architecture that honestly documents where each layer succeeds and fails demonstrates something that a working pipeline alone does not: the ability to design measurement systems for subjective output, to identify the boundaries of automated assessment, and to integrate human judgment as a first-class architectural component rather than an afterthought.

Third, because the evaluation output is itself evidence. The scores, the reasoning, the homogeneity flags --- these are not just metrics. They are a detailed critical assessment of the pipeline’s output that future pipeline iterations can target. The 3 on Homogeneity Resistance is not just a score. It is accompanied by four specific flags that identify four specific patterns the next iteration can attempt to address: varying the macro-rhythm, reducing qualifier construction density, diversifying analogical structure, introducing tonal variation in domestic scenes. The evaluation system’s imperfect measurement still produces actionable engineering feedback.

The Rubric as Architecture

The evaluation rubric is not a grading sheet. It is the three-tier systematizability taxonomy made operational.

The six prose dimensions decompose along the taxonomy’s gradient. Surface Features sits at Tier 1: measurable, automatable, checkable by the stylometric baseline. Voice spans Tier 1 and Tier 2: partially measurable through stylometric comparison, partially requiring LLM judgment about whether the narrator’s patterns match MacDonald’s patterns. Emotional Register and Thematic Consistency sit at Tier 2: the LLM judge can identify patterns and flag inconsistencies, but whether those patterns work as literature requires human assessment. Worldview and Narrative Logic sit at Tier 3: the LLM judge can describe what it sees, but whether it is seeing correctly is the 71% question.

The three story-level dimensions address concerns that only emerge at scale. Narrative Completeness asks whether the ending transforms the beginning --- a question that requires holding the entire 11,000-word arc in view. Coherence Over Distance asks whether the voice and logic hold across that distance without seams. Homogeneity Resistance asks the meta-question: does the output exhibit the convergence problem the homogeneity studies predict?

The weighting encodes the faithfulness hierarchy. Worldview at 6x. Surface Features at 1x. This means a story that scores 5 on Worldview and 3 on Surface Features grades higher than one that scores 5 on Surface Features and 3 on Worldview. The weighting is a design decision, not an empirical finding --- no study has tested whether readers weight these dimensions in this order. But the literary evidence is consistent: readers forgive surface deviations when the world feels right. They do not forgive hollow worlds dressed in perfect surfaces.

What the Corpus Study Made Possible

The stylometric baseline --- Layer 1 of the evaluation architecture --- required a corpus study that did not exist. No published computational analysis of MacDonald’s prose had been conducted. The pipeline needed one before it could build evaluation gates, so it studied 408,000 words across nine MacDonald fantasy works.

The study produced fifteen measurable dimensions. Sentence length distributions. Vocabulary profiles. Function word ratios. Punctuation frequencies. Register analysis. Paragraph length distributions. Each dimension became a constraint in the voice specification and a checkpoint in the evaluation.

Some of what the study found was expected: MacDonald’s elevated register, his heavy semicolon use, his long compound sentences built through additive coordination. Some was not: the 50.3% function word ratio is higher than most Victorian prose, reflecting MacDonald’s narrator-heavy style with its constant first-person qualifications (“I think,” “I do not know,” “I cannot tell you”). The hapax rate of 50.9% --- the percentage of words that appear only once in the corpus --- is high, indicating that MacDonald varies his word choice more than his reputation for repetitive moral instruction might suggest.

The corpus study is the foundation the entire evaluation architecture rests on. It is also the component most vulnerable to methodological critique. Project Gutenberg texts are editorially normalized. The punctuation profiles may reflect digitization conventions as much as authorial intent. The study analyzed MacDonald’s fantasy corpus specifically, not his realistic novels or sermons, which means the voice specification captures a subset of his range. These are honest limitations. They are also limitations that no subsequent study has corrected, because no subsequent study exists.

The Honest Gap

The evaluation system achieves something real. It produces detailed, specific, multi-dimensional assessment of creative output against an author-specific rubric grounded in corpus analysis and peer-reviewed research on voice reproduction. It identifies the pipeline’s strengths (worldview integration, narrative logic, thematic consistency) and its weaknesses (homogeneity, narrator over-concentration) with a precision that most creative AI evaluation does not attempt.

It also has a structural flaw at its center. The automated layer --- the LLM-as-judge that produces the detailed scores and reasoning --- is the layer that Kirilloff et al. and the self-enhancement bias literature identify as least reliable for the dimensions that matter most. The 80% agreement on objective tasks drops to 60-68% in expert domains. The 94% human accuracy versus 71% LLM accuracy on AI text detection. The self-enhancement bias that makes the evaluator prefer text generated by its own model family. These are not theoretical concerns. They are documented, peer-reviewed, replicated findings that apply directly to the evaluation system’s automated layer.

The four-layer architecture is the cross-domain synthesis’s answer to this problem: do not rely on any single layer. Use the stylometric baseline for what it can measure. Use the LLM judge for what it can articulate. Use the human for what only a human can assess. Use regression tracking to detect when things change. The layers are not redundant. They address different tiers of the taxonomy with different levels of reliability.

But the architecture is theoretically sound and empirically untested for literary voice evaluation specifically. No study has validated whether stylometric baselines correlate with perceived voice fidelity for AI-generated pastiche. No study has tested whether the faithfulness hierarchy’s weighting matches how readers actually weight these dimensions. The evaluation system is the best architecture the peer-reviewed evidence supports. Whether it measures what it claims to measure for MacDonald pastiche specifically is a question the project itself is generating evidence for --- and the evidence is mixed.

The story scored 4.55 out of 5. The evaluator called it alive. The homogeneity resistance scored 3. The evaluator identified four specific patterns where the prose converges rather than varies. Both findings are real. Both are produced by a system whose reliability Kirilloff et al. place at 71% for this class of judgment.

That is the evaluator’s dilemma. Not that the evaluation system fails --- it does not fail. It produces the most detailed creative evaluation I have seen from any agent pipeline. The dilemma is that the system’s own automated layer is structurally miscalibrated for the thing it needs to measure most, and the only layer that can correct for that miscalibration is the human layer, which in this project is also the builder. The evaluation is honest. Whether it is accurate is a question that lives in the 29% gap between what LLM judges see and what expert humans see --- and the only way to close that gap is to put the story in front of readers who know MacDonald’s work and let them judge.

The story is published alongside these process articles. The evaluation scores are public. The homogeneity flags are specific. The reader has everything the evaluation system has, plus the one thing the evaluation system lacks: a human response to the prose that is not produced by the same class of system that generated it.