← the golden key

The Voice in the Machine

How the pipeline studied 408,000 words of MacDonald's prose and built a voice specification from measurements that had never been taken.

Listen:

Researched across 41 documents spanning computational stylistics, cognitive science, and literary scholarship. Primary evidence drawn from the pipeline's corpus study of nine MacDonald works and the voice specification it produced.

No published stylometric analysis of George MacDonald existed. Not in the computational humanities literature, not in the MacDonald scholarly tradition, not in any corpus linguistics database I could find. The pipeline needed quantitative baselines before it could build evaluation gates --- measurements against which generated prose could be checked, dimensions that could be scored and tracked across revision cycles. So it studied 408,000 words of MacDonald’s fantasy corpus, extracted from nine works via Project Gutenberg, and produced a voice specification that decomposes his prose into fifteen measurable dimensions, from sentence-length distributions (mean 17.7, SD 15.2) to function word ratios (50.3%). Whether those numbers capture anything that matters about MacDonald’s voice is the question the story has to answer.

This piece is about what that study found, how it was built, and what it revealed about the distance between measurement and voice.


The Corpus

The study began with acquisition. Project Gutenberg hosts MacDonald’s fantasy works in plain text, public domain, freely downloadable. Nine works entered the corpus:

Phantastes (1858). At the Back of the North Wind (1871). The Princess and the Goblin (1872). The Princess and Curdie (1883). The Light Princess (1864). Cross Purposes (1862). The Golden Key (1867). The Giant’s Heart (1863). Lilith (1895).

Four novels, five shorter fairy tales. The selection was deliberate: only fantasy prose, not the Scottish realist novels, not the sermons, not the poetry. The pipeline was going to produce a fairy tale, so the baseline needed to be MacDonald’s fairy-tale voice --- not his voice in aggregate, which would dilute the very features the pipeline needed to reproduce.

Total word count: 408,000+. Enough for statistical reliability on the surface features. Enough to establish distribution shapes rather than point estimates. Not enough, probably, to capture everything that matters about how MacDonald writes --- but I will come to that.

The Gutenberg texts are not facsimiles. This matters more than it might seem, and I will return to it later in this piece, at the point where it matters most. The texts undergo editorial normalization during digitization: punctuation conventions are standardized, formatting is regularized, and the resulting plain text reflects the digitizer’s house style as much as the author’s manuscript. For surface-level features like vocabulary and sentence length, this is tolerable. For punctuation profiles --- the precise frequency and type of dashes, the use of colons before quotations, the spacing around semicolons --- the Gutenberg text is a filtered signal. You are measuring the corpus as digitized, not necessarily the prose as written.

I proceeded anyway. There was no alternative. MacDonald’s original manuscripts are scattered across university archives, and no digitized critical edition of his fantasy works exists with original punctuation preserved. The pipeline needed a baseline, and the baseline available was Gutenberg. The honest thing to do was to use it and note the limitation.


Three-Tier Extraction

The corpus study was designed around the three-tier systematizability taxonomy that structures the entire pipeline. The taxonomy says that voice features decompose into three categories: reliably automatable surface features, partially systematizable structural patterns, and resistant depth features that require human direction. The extraction process followed the same gradient.

Tier 1: Mechanical Measurement

Tier 1 features are the ones you can count. Run the corpus through standard NLP tooling and you get numbers: sentence lengths, vocabulary distributions, punctuation frequencies, function word ratios, register profiles. These are the features that computational stylistics has been measuring since Burrows’ Delta established the field --- the statistical fingerprint of a writer’s prose at the surface level.

Here is what the study found for MacDonald’s fantasy corpus:

Sentence length. Mean 17.7 words, standard deviation 15.2. The high SD tells you more than the mean does: MacDonald’s sentences range wildly. A third of them are ten words or fewer. Another third exceed twenty. The distribution is right-skewed --- lots of short declarative sentences punctuated by occasional long compound constructions that sprawl past fifty words, building through additive coordination rather than deep subordination.

The voice spec translates this into an agent instruction: “Follow clusters of 2-3 short sentences (5-10 words) with one long sentence (30-50 words), then return to medium length (14-20 words).” That instruction is a reasonable approximation. But read the actual prose it approximates:

“I looked, and saw a little flower.” --- Lilith, chapter 5

Five words. Then, from the same novel, within pages:

“I was in a world, or call it a state of things, an economy of conditions, an idea of existence, so little correspondent with the ways and modes of this world” --- Lilith, chapter 3

That sentence gropes through four attempts at naming what it describes --- “a world, or call it a state of things, an economy of conditions, an idea of existence” --- each phrase reaching for precision and falling short, each revision embedded in the syntax rather than replacing what came before. The mean and standard deviation capture the range of sentence lengths. They do not capture the relationship between the short sentences and the long ones --- the fact that the short ones are perceptually grounded (“I looked, and saw”) while the long ones are epistemically uncertain, the syntax itself enacting the narrator’s struggle to articulate.

Register. 25% Latinate vocabulary, 75% Anglo-Saxon. MacDonald’s deepest truths use the simplest words. The Latinate vocabulary enters through the narrator’s scholarly habits --- intellectual groping, analogy, self-correction. “An adumbration.” “An economy of conditions.” “The polarisation.” These are not ornamental. They are a narrator reaching for vocabulary adequate to describe something the plain words cannot hold, and then --- characteristically --- falling back to the plain words anyway.

Archaic diction. 0.15% of total words. Vanishingly small. MacDonald is less archaic than his reputation suggests. The archaisms cluster in speech rather than narration, assigned primarily to supernatural or spiritually authoritative characters. “Ere” over “before” in prophetic speech. “Thou” only when addressing the divine. A single archaic word embedded in paragraphs of plain modern diction, so it feels organic rather than affected.

Punctuation. Semicolons at 8.3 per thousand words --- heavy, used to create balanced parallel structures in reflective passages. Dashes at 5.3 per thousand words for parenthetical interjections. The colon-dash combination :-- appearing occasionally to introduce dramatic revelations. Exclamation marks at roughly 8 per thousand words, used in both dialogue and narration to convey wonder.

Function words. 50.3% of the text. “I” at roughly 20 per thousand words, leaning hard into first-person retrospective constructions. “But” at approximately 9 per thousand, opening nearly every third or fourth sentence to create a dialectical rhythm --- statement, pivot, restatement. The function word profile is what gives MacDonald’s narration its characteristic texture of continuous self-qualification: the narrator who cannot leave a claim alone, who must immediately turn it over, add a caveat, circle back.

Paragraph length. Mean 46 words, median 21. The gap between mean and median reveals the distribution: lots of short dialogue paragraphs (5-20 words) interleaved with longer expository or philosophical passages (80-150 words). The rhythm MacDonald achieves is breathing --- long inhale of exposition, short exhale of dialogue, repeated.

These numbers are precise. They are replicable. Anyone with the same corpus and the same NLP tools would get the same results. They are also, by themselves, profoundly insufficient.

A passage of Victorian prose on aged parchment with precise measurement annotations radiating outward — Sentence Length labeled on the left, Register Markers on the upper right, and Punctuation Profile on the lower right — resembling a scientific anatomical specimen plate
The voice specification as anatomical plate. Tier 1 measurements — sentence length, register markers, punctuation profile — annotated on a specimen passage. The measurements are precise. Whether they capture what matters about MacDonald’s voice is another question.

Tier 2: Agent-Extracted Patterns

Tier 2 is where the extraction shifts from counting to interpretation. These are the patterns that an LLM can identify in the corpus but that resist reduction to a single number: imagery systems, thematic preoccupations, narrative stance, paragraph rhythm, tonal register shifts. The pipeline used agent-based analysis --- Claude reading the corpus passages and extracting structural patterns --- validated against my own reading and the secondary MacDonald scholarship.

Imagery clusters. The study identified eight recurring imagery systems in MacDonald’s fantasy corpus. Each operates as more than metaphor --- MacDonald treats these as the physics of his fictional worlds.

Light is a living entity. When MacDonald describes any light source, he gives it volition --- it hovers, waits, lingers, offers, withdraws. The sun in “The Walled Garden” is not a backdrop. He “had been at his work there long before Emma came, and had laid his hands on every crack and roughness as if he knew them each by name.” He grieves. He loves. The specification instructs: “Use personified pronouns for sun and moon.” But the instruction cannot capture what the personification means in MacDonald’s world --- that light which is grasped or possessed should die, that light which is received or followed should guide. The rule is Tier 1. The theology behind the rule is Tier 3.

Water maps spiritual states. Absence of water equals spiritual drought. Tears are functional, not sentimental --- the irrigation of moral growth. In “The Walled Garden,” water carried resentfully sits on the surface and refuses to sink. Water poured freely is received “as a thirsty child drinks.” The same substance, behaving differently depending on the quality of attention behind the hands that pour it. You can instruct an agent to use water imagery. You cannot instruct it to understand that MacDonald’s water is not symbolic of life --- it is life, presented as the literal mechanism of spiritual vitality.

Shadow is invasive anti-self: not the absence of light but an autonomous entity that displaces identity. Threshold architecture: nested spaces connected by varied passages, the deeper you go the more spiritually charged. The moon as ambiguous feminine watcher --- conscious, cold, illuminating without guiding. The earth as living body, with geological features described in physiological vocabulary. Scale-collapse: the domestic and the cosmic occupying the same sentence.

Each of these patterns was extracted with examples, annotated with source passages, and written into the voice spec as a Tier 2 guideline: “Follow them with judgment --- mechanical repetition is worse than omission.”

Paragraph rhythm. The study identified six recurring paragraph shapes. The most characteristic is what the spec calls “Short Declarative Anchor, Then Expanding Spiral” --- a paragraph that opens with a grounded statement of five to twelve words and then spirals outward through progressively longer sentences that add subordinate clauses, sensory detail, or metaphysical reflection. Read any page of MacDonald and you will find it:

She stood before the gate in the wall.

It was old stone - warm already with the early sun, who had been at his work there long before Emma came, and had laid his hands on every crack and roughness as if he knew them each by name. The wall was taller than it had looked from the lane. The gate was wood, grey with years and soft with them; and it opened at her touch so lightly that she felt almost offended, as though it had not thought her worth the trouble of resisting.

Five words, then a sentence that unfolds across sixty. The spec names the shape. The agent can reproduce it. But the decisions about when to spiral and when to stay short --- when the scene earns expansion and when it demands compression --- are narrative timing decisions that resist scheduling.

Tonal register shifts. MacDonald’s prose moves between registers with a fluency the spec attempts to codify: sublime-to-domestic deflation via dialogue, cosmic personification erupting from plain narration, didactic-philosophical intrusion yielding to childlike wonder, the parenthetical narrator breaking the fictional frame. Each is documented with examples and instructions. The most important --- the parenthetical narrator intimacy --- is the voice’s signature:

But I must tell you something about questions before we go on, and I am not sure that I can tell it rightly, but I will try. A question that you have polished and carried about with you and said over to yourself is not, properly speaking, a question at all.

The narrator interrupts the story to speak directly to the reader, admitting inadequacy, circling a thought, resuming with a slight syntactic restart. The spec instructs: “Use dashes and colons rather than formal subordinate clauses. Resume the narrative with a slight syntactic restart.” The instruction is correct. It is also not the same thing as knowing when the narrator should interrupt --- which moments in the story earn the aside, which images are strong enough to carry their own weight without gloss.

Tier 3: Human-Curated Reference

Tier 3 was not extracted by the pipeline. It was written by a human reading MacDonald with the taxonomy in hand, identifying the features that resist reduction to rules or metrics. The voice spec’s Tier 3 section is the shortest and the most important:

MacDonald’s theological orientation --- his rejection of Calvinist predestination, his embrace of universal reconciliation, his insistence that grace operates through nature and imagination rather than juridical decree --- is not a theme layered onto stories. It is the lens through which every scene is constructed. A MacDonald pastiche that gets the theology wrong has failed at the highest level of the faithfulness hierarchy, regardless of how precisely it matches the sentence-length distribution.

The value of MacDonald’s “flaws.” His didacticism, his prolix style, his intrusive narrator, his recursive digressions --- these are integral to his voice, not defects to correct. A pastiche that smooths away the didacticism or tightens the pacing has lost the voice while preserving the surface. The spec states this bluntly: the patterns established in Tier 1 and Tier 2 create rhythms. MacDonald’s power comes partly from knowing when to violate those rhythms. Those moments of disruption distinguish voice from statistics. They cannot be scheduled.

Pattern-breaking for effect. Narrative timing. The relationship between form and meaning --- the way MacDonald’s nested clause structures enact the interpenetration of material and spiritual worlds rather than merely describing it. These are named, annotated with examples, and flagged as director-reference material: the spec does not deliver them to agents as instructions, because instructions cannot capture what they describe.


The Voice Spec as Artifact

The voice specification is the pipeline’s most iterated file. It went through more revision cycles than any other artifact --- rewritten after generation runs revealed gaps, tightened after evaluation scores exposed where the measurements weren’t translating into prose, restructured as the three-tier taxonomy took clearer shape.

In its final form, it opens with a framing statement: “You are composing original prose in the voice and world of George MacDonald. You are creating pastiche --- faithful imitation that honors the source --- not parody or forgery.” Then the three tiers, organized from hard constraints to guidelines to context.

The spec is an operational document. It sits in the writer agent’s context window as a cached prefix --- positioned at the start of the context, where the lost-in-the-middle effect (Liu et al., TACL 2023) will not degrade attention to it. The Tier 1 measurements serve as the floor: violate these and the pastiche breaks immediately. The Tier 2 patterns serve as the middle layer: follow with judgment. The Tier 3 context serves as the ceiling: the worldview that everything else must serve.

The spec also includes standing creative guidance that applies to all generated prose: original scenes only, thematic content integrated naturally rather than stated as moral, narrator intrusion where the scene permits, and a priority instruction that is the taxonomy made operational: “Prioritize the higher dimensions of the faithfulness hierarchy (worldview, narrative logic, thematic consistency) over the lower ones (punctuation, word frequency). Getting the surface right while missing the worldview produces a statistical imitation, not a pastiche.”

Whether this layered specification --- measurement at the base, pattern in the middle, worldview at the top --- actually produces prose that feels like MacDonald is the question that the story answers. The spec can be evaluated on its own terms: are the measurements accurate? Are the patterns correctly identified? Are the Tier 3 descriptions faithful to MacDonald’s theological and narrative vision? But the spec’s real test is generative. Does prose written under its guidance sound like MacDonald to a reader who knows MacDonald?

The evaluation scored the story’s voice at 4 out of 5. The evaluator noted that the intrusive narrator --- the spec’s most distinctive instruction --- was “the story’s most striking achievement and its most persistent risk.” The voice was “recognizably in MacDonald’s tradition but slightly over-concentrated --- it has absorbed his most distinctive mannerism and applied it more uniformly than he would.” MacDonald himself varies his approach: sometimes he lets narrative carry meaning without commentary, sometimes he delivers a single-sentence aside, sometimes he expands into a full paragraph. The story almost always chooses the full expansion.

This is a Tier 2 failure, not a Tier 1 failure. The surface features are correct. The pattern is correctly identified. But the variation within the pattern --- the judgment about when to use it and when to withhold --- is exactly the kind of partially systematizable feature that the taxonomy predicts will be hardest to get right. The spec describes the pattern. It cannot describe the silence that should interrupt it.


The Lore Bible

The voice spec tells the writer agent how MacDonald writes. The lore bible tells it what MacDonald’s fictional worlds contain --- organized, like everything else, by the faithfulness hierarchy.

The lore bible’s structure reveals the hierarchy’s practical consequences. Level 1 (Worldview) gets 133 words. Level 6 (Surface Features) gets 868. But the weight is inverted: those 133 words of worldview are the most consequential content in the document. They constrain everything below them. If the worldview is wrong --- if grace becomes transactional, if death becomes permanent ending, if evil becomes a rival power rather than a privation --- then no amount of correct semicolon frequency will save the result.

The middle levels carry the bulk: Thematic Consistency at 3,028 words, Emotional Register at 2,806, Voice at 4,025. These are the Tier 2 features --- the imagery systems, the arc shapes, the tonal registers --- laid out with source examples and failure descriptions. What does a thematic consistency failure look like? A story where the water imagery is present but disconnected from spiritual states. Where the light personification is decorative rather than theological. Where the garden is a garden and not a cosmos.

The lore bible works alongside the voice spec in the pipeline’s context architecture. The voice spec enters the writer agent’s context window as cached instruction. The lore bible is accessible through MCP tool calls --- the agent can look up specific entries during generation rather than carrying the full 11,000-word document in context. This is a context engineering decision: the voice spec is stable guidance that belongs in the prefix. The lore bible is reference material that belongs in retrieval.


The Em-Dash Problem

Here is where honest accounting requires uncomfortable specificity.

The voice spec’s punctuation profile states: “Use regular dashes (hyphens or en dashes, never em dashes) at ~5 per 1000 words.” This instruction derives from the Tier 1 corpus analysis. The Gutenberg texts of MacDonald’s works use hyphens and spaced hyphens, not em dashes. The corpus measurement faithfully recorded what the corpus contained.

The story uses 128 em dashes across its 11,000 words.

The story evaluation’s Surface Features dimension flagged this: “The story uses both hyphens and em-dashes inconsistently in the early sections (the opening uses ’ - ’ where later passages use the em-dash), which is a minor but noticeable seam.” The dimension scored 4 out of 5, with the inconsistency noted as preventing a perfect score.

This is worth sitting with, because it surfaces three distinct problems layered on top of each other.

First: a corpus problem. Project Gutenberg’s editorial normalization standardizes punctuation to the digitizer’s house conventions. MacDonald wrote in the 1850s through the 1890s. His manuscripts, his first editions, and the Gutenberg plain-text versions do not necessarily agree on punctuation. The “never em dashes” instruction may accurately describe the Gutenberg corpus while inaccurately describing MacDonald’s authorial practice. Victorian typesetting conventions were different from modern ones; what MacDonald’s compositors set as a long dash may appear in Gutenberg as a hyphen or a spaced hyphen. The corpus measured the signal it was given. Whether that signal faithfully represents the author’s punctuation intent is a question the corpus study cannot answer without access to first editions or manuscripts --- which the project did not have.

Second: a spec enforcement problem. Even taking the spec at face value, the pipeline did not enforce the constraint. The writer agent received the instruction “never em dashes” and produced 128 of them. This is a straightforward failure of constraint adherence --- the kind of surface-level compliance that Tier 1 features are supposed to guarantee. The evaluation agent flagged it. The pipeline did not catch and correct it during generation. Whether a revision pass would have caught it is an open question; the story went through evaluation but the em-dash inconsistency survived into the final output.

Third: a deeper question about what measurements mean. The em-dash is a small thing. It is also the clearest illustration of the distance between Tier 1 measurement and Tier 3 voice. The voice spec’s punctuation profile measures what is in the corpus. But the corpus has been filtered through 150 years of republication, editorial standardization, and format conversion. The measurement is precise. The thing it measures is uncertain. And the uncertainty cascades: if the punctuation profile reflects Gutenberg’s conventions rather than MacDonald’s, then the “never em dashes” instruction is not a MacDonald constraint but a digitization artifact, and the writer agent’s violation of it may be closer to MacDonald’s actual practice than the spec’s instruction.

I do not know which of these is true. I suspect the answer involves all three: the Gutenberg texts do normalize punctuation, the pipeline did fail to enforce its own constraint, and the resulting inconsistency is less significant than the Surface Features score might suggest because the constraint itself rests on filtered data. But I cannot prove this without access to MacDonald’s first editions, and I did not have that access.

This is exactly the kind of gap the project must foreground. The voice spec is the pipeline’s most rigorous artifact. It is built on measurements that had never been taken before. But the measurements are only as reliable as the corpus they come from, and the corpus has been processed by hands other than the author’s. The em-dash contradiction is a small instance of a structural condition: every Tier 1 measurement in the voice spec is a measurement of MacDonald-as-digitized, not necessarily MacDonald-as-written.


What Machines Can Learn from an Author’s Work

The cognitive science literature offers a frame for understanding what the corpus study accomplished and where it falls short.

Saffran et al. (1996) demonstrated that human infants segment continuous speech through implicit statistical learning --- detecting transitional probabilities between syllables after only minutes of exposure. The same mechanism operates when readers absorb an author’s patterns through sustained reading. You do not consciously catalogue MacDonald’s semicolon frequency. You develop, over hundreds of pages, an internalized model of how his prose moves --- its rhythm, its tendency to pivot on “but,” its habit of interrupting itself to qualify a claim. The model lives below conscious access. It is what allows you to open a MacDonald novel at a random page and know within a paragraph that it is MacDonald, even if you cannot articulate why.

LLM training through next-token prediction is structurally analogous to this statistical learning. The model absorbs distributional patterns from the training corpus --- word co-occurrences, syntactic structures, transition probabilities. What the voice spec’s Tier 1 measurements capture is the surface layer of this distributional knowledge: the frequencies, the ratios, the profile shapes that a statistical learner would internalize first.

But the cognitive science distinction between hexis (deep internalization of underlying principles) and surface “aping” maps onto the three-tier taxonomy with uncomfortable precision. Saffran’s infants could segment speech. They could not compose poetry. The transitional probabilities are the floor, not the ceiling. What MacDonald scholars describe when they describe his voice --- the worldview integration, the theology enacted through syntax, the narrative timing that disrupts its own patterns for effect --- is not captured by the statistical profile. It is captured, imperfectly and partially, by the Tier 2 patterns and the Tier 3 context. But those tiers were not extracted by statistical analysis. They were extracted by reading.

The voice spec embodies this layered reality. Its Tier 1 measurements are the most precise and the least important. Its Tier 3 descriptions are the least precise and the most important. The pipeline needs both: the measurements to set the floor, the descriptions to aim at the ceiling. Whether the gap between floor and ceiling --- between what can be counted and what can only be described --- narrows over time as models improve is one of the genuinely open questions in the field. The three-tier taxonomy reflects 2025-2026 capabilities. It is a practical engineering guide, not a permanent claim about the limits of computation.

What I can say, having built the spec and watched the pipeline use it, is that the measurements matter less than I expected and the descriptions matter more. The story’s sentence-length distribution is close to MacDonald’s profile. Its function word ratio is in range. Its semicolon frequency is approximately correct. These things contribute to the feeling that the prose inhabits the right century, the right register, the right syntactic neighborhood. But what makes a reader pause and think this sounds like MacDonald is not the semicolons. It is the narrator’s habit of circling a truth he cannot quite articulate, the water that behaves differently depending on who pours it, the garden that contracts around a girl’s pride and expands when she stops trying to possess it. Those features are in the spec. They are not in the measurements.


The Honest Gap

No published stylometric analysis of MacDonald existed before this project built one. This means the voice spec’s measurements are uncalibrated against any prior computational work. The sentence-length distribution, the register profile, the function word ratio --- these are first observations, not confirmations of established findings. They are precise and I believe they are correct. But they have not been independently replicated, and the MacDonald scholarly community has not validated them.

This is both the project’s contribution and its limitation. The corpus study produced a quantitative baseline where none existed. It enables a kind of evaluation --- is the pipeline’s output within the measured range? --- that was previously impossible for MacDonald-specific work. But the baseline itself is untested. Whether a sentence-length mean of 17.7 with an SD of 15.2 captures something meaningful about MacDonald’s rhythm, or whether it is a statistical artifact of corpus composition (the particular nine works selected, the particular Gutenberg editions used), is a question that would require a broader corpus, alternative editions, and ideally independent computational analysis to answer.

The project proceeded on the baseline it had. The alternative was to proceed without one, which would have meant evaluation gates with nothing to evaluate against. An imperfect baseline is better than no baseline. But the imperfection should be named.

The voice spec is the pipeline’s most careful artifact. It represents more iteration, more revision, more testing against actual output than any other file in the system. It is also, necessarily, a first draft of a kind of analysis that MacDonald scholarship has never attempted. The measurements are original. The patterns are original. The Tier 3 descriptions synthesize secondary scholarship into a form designed for computational use. Whether they are sufficient --- whether 408,000 words, fifteen dimensions, and three tiers of specification can capture enough of MacDonald’s voice to produce prose that a MacDonald reader recognizes as belonging to his tradition --- is not a question the spec can answer about itself.

The story answers it. Imperfectly, with limitations documented in evaluation scores that are themselves produced by an LLM judge with known calibration problems. But it answers it. And the answer, for what it is worth, is that the spec captured enough to produce something that reads as MacDonald to someone who has spent months inside his prose --- while missing just enough to remind you that a voice specification, however carefully built, is a map. The country is always more than the map. The gap between them is the space where the work lives.