'The Miranda Hypothesis: How Hamilton Poisoned Persona Evals - Jacob E. Thomas, Results Gen'
'Current evaluations of role-playing language agents measure fluency and personality consistency but cannot detect their dominant failure mode: anachronistic ...'
By Sean WeldonEpistemic Simulation and the Failure of Fluency-Based Evaluation in Historical Persona Modeling
Abstract
Current role-playing language agents achieve high personality consistency scores (80.7% alignment) while systematically failing to detect anachronistic compositing - the dominant failure mode wherein culturally dominant representations overwhelm documentary records. This analysis presents epistemic simulation as a necessary fourth paradigm stage, distinguished by corpus-bounded reasoning, temporal anchoring, and expert-loop evaluation. The Miranda Hypothesis explains how training corpora containing orders of magnitude more cultural representations than primary sources cause auto-regressive models to default to salience-weighted composites. Analysis of architectural approaches reveals that context window methods preserve documentary provenance while fine-tuning dissolves evidentiary chains through catastrophic forgetting. The Prism evaluation framework demonstrates pre-registered methodologies using domain experts as evaluators, establishing infrastructure for faithful historical reasoning. These findings have immediate implications for archival applications where fabrication constitutes ethical violation rather than mere technical limitation.
1. Introduction
Role-playing language agents (RPLAs) represent systems engineered to instantiate personas - historical, fictional, or hypothetical - and generate outputs consistent with those identities. Contemporary benchmarks report impressive performance metrics, with in-character evaluations demonstrating 80.7% alignment with human-perceived personalities. However, these evaluations measure fluency and personality consistency while remaining blind to the dominant failure mode: outputs that sound convincingly like the persona while reasoning from knowledge the historical figure never possessed.
Consider a Hamilton persona that produces abolitionist positions reflecting the 2015 Broadway musical rather than the 1789 documentary record, obscuring the historical figure's documented involvement in enslaved person transactions. The system achieves high scores on personality consistency while reasoning from a cultural composite rather than historical evidence. This phenomenon reveals a fundamental measurement gap in current evaluation frameworks.
This analysis establishes the distinction between mask and mirror in persona evaluation. The mask measures whether outputs sound like the person; the mirror measures whether outputs reflect what the person could have known or believed at a specific moment. Convincingness and fidelity constitute independent properties. A system can score perfectly on personality consistency metrics while producing reasoning anachronistic to the historical moment being simulated.
The central thesis argues that epistemic simulation - grounded in primary documents, temporal constraints, and expert evaluation - represents a necessary paradigm advancement for building systems that reason faithfully from historical records rather than cultural composites. This synthesis proceeds through theoretical foundations explaining compositing mechanisms, architectural comparisons, evaluation methodologies, and implications for accessibility and disciplinary collaboration.
2. Background and Related Work
Historical persona modeling has evolved through three paradigm stages. Rule-based templates produced scripted responses with no capacity for novel reasoning. Imitation learning matched surface-level stylistic features, improving fluency while remaining untethered from documentary constraints. Cognitive simulation attempted to model internal motivational architectures, asking whether the system captures how the persona thinks rather than merely what they might say.
Each stage improved consistency and plausibility but lacked mechanisms to verify documentary fidelity. Current benchmarks evaluate whether outputs exhibit personality traits consistent with human perceptions - the mask - but cannot assess whether reasoning reflects knowledge the historical figure could have possessed at a specific temporal moment - the mirror. This evaluation gap enables systems to produce outputs that are convincing yet fundamentally anachronistic, scoring well on existing metrics while failing on the dimension most critical for historical applications.
Two architectural lineages dominate current persona construction. Retrieval-Augmented Generation (RAG) places anchor documents in context windows at inference time, allowing models to reason through documentary evidence. Fine-tuning adjusts model weights using persona-specific training data, attempting to make the model "be" the persona. These architectural choices have profound implications for documentary fidelity and evidentiary traceability.
3. Core Analysis
3.1 The Miranda Hypothesis: Corpus Composition and Compositing Mechanisms
The Miranda Hypothesis explains why anachronistic compositing occurs through three mechanisms operating in current training and deployment pipelines. First, training corpora contain orders of magnitude more culturally dominant representations than primary documentary records. The Federalist Papers comprise approximately 175,000 words; content derived from Hamilton: An American Musical exceeds this documentary record by orders of magnitude and is more recent and recurrent in web-scale training data.
Second, auto-regressive next-token prediction compresses both primary sources and cultural representations into model parameters with no architectural capacity to distinguish a 1789 letter from a 2019 viral tweet. The training objective optimizes for predictive accuracy across the entire distribution, weighting representations by their frequency and recency. The model's Hamilton speaks the musical's Hamilton because the musical's Hamilton dominates the training distribution.
Third, post-training alignment amplifies rather than corrects compositing. Human raters evaluate outputs using conceptual frameworks built by the same culturally dominant narratives that saturate the training corpus. A Hamilton output receives high ratings when it matches rater expectations shaped by the musical, creating a feedback loop that reinforces the composite rather than grounding outputs in documentary evidence. The output defaults to a salience-weighted composite: fluent, plausible in register, morally legible to modern users, but corresponding to the figure at no verifiable moment in their life.
3.2 Architectural Approaches: Context Window Versus Fine-Tuning
Two architectural approaches offer distinct trade-offs for documentary fidelity. Context window methods anchor documents at inference time, placing primary sources directly in the prompt alongside the query. This approach preserves documents intact - they enter the context window and remain available for verification. The model speaks through the persona's record rather than attempting to be the persona.
Fine-tuning methods adjust model weights using persona-specific training data, layering thin personal signal over vast cultural sediment in base model weights. This approach merges the documentary record with everything else the training corpus contains about the figure, amplifying compositing mechanisms. Fine-tuning suppresses random distortion at the surface level while amplifying it underneath, as the optimization process cannot distinguish documentary evidence from cultural representation.
Empirical evidence from adjacent domains supports these architectural distinctions. A 2026 Nature Medicine study demonstrated that general-purpose frontier models from Google, OpenAI, and Anthropic outperformed dedicated specialized clinical AI tools on physician-reviewed tasks across 12 clinics. Biomedically fine-tuned models underperformed base models due to catastrophic forgetting - the mechanism by which fine-tuning on narrow corpora degrades the broad capabilities that made models effective. Fine-tuning on a narrow corpus causes models to forget patterns and knowledge from pre-training, degrading performance on tasks requiring general reasoning.
Context window architectures preserve the chain of provenance essential for documentary verification. Documents remain inspectable, enabling return to source to verify whether the persona's reasoning was faithful to available evidence. Fine-tuning dissolves documents into parameters, breaking this evidentiary chain and making verification impossible without access to training data and model internals.
3.3 Epistemic Simulation: A Fourth Paradigm Stage
Epistemic simulation represents a fourth paradigm stage distinguished by three commitments absent from cognitive simulation. First, corpus-bounded reasoning licenses outputs only by specific primary documents provided at inference time. The system cannot draw on general knowledge about the figure accumulated from cultural representations. Second, temporal anchoring instantiates the persona at a specific moment with explicit knowledge constraints. The 1858 Lincoln cannot reason from the Gettysburg Address he will write in 1863. Third, expert-loop evaluation requires domain experts to judge outputs against the evidentiary record rather than relying on automated metrics or crowdsourced ratings.
In cognitive simulation, constraint is internal - the persona's motivational architecture and personality traits. In epistemic simulation, constraint is external, documentary, and temporal. The persona is not a property of model weights but a configuration: the structured prompt, anchor material, temporal anchor, language model, and human curator together constitute the system. The persona is an event that occurs when model, document, and human convene - no more located in weights than Hamlet is located in Laurence Olivier's body.
This configuration-based approach enables four capabilities critical for historical applications. Versioning treats the prompt, corpus, and temporal anchor as diffable artifacts that can be tracked and compared across iterations. Auditing ensures all inputs remain inspectable in the context window rather than compressed into opaque parameters. Reproduction allows the encounter to be recovered given the configuration specification. Handoff to domain experts makes the system legible to historians, archivists, and scholars rather than requiring machine learning expertise to evaluate fidelity.
3.4 The Prism Evaluation Framework
The Prism framework operationalizes epistemic simulation through controlled experimental design. The prism metaphor captures the core insight: white light (the composite persona) must be refracted into a spectrum (multiple versions of the figure across different moments) to detect anachronistic compositing. The experimental design uses Abraham Lincoln across four temporal moments: 1847 (Whig congressman opposing war as unconstitutional), 1858 (Free Soil Republican denying Black citizenship), 1860 (constitutional unionist), and 1862-65 (emancipator and theologian).
Three seeding conditions test architectural hypotheses. The bare model receives no anchor documents, establishing a white light baseline showing the composite persona. The primary sources condition provides documentary evidence, functioning as a clear prism that should separate the composite into distinct temporal versions. The biography condition provides secondary sources that may produce false coherence, functioning as a clouded prism that smooths over documented contradictions.
Five diagnostic questions map documented fault lines where the four Lincolns demonstrate measurably different reasoning: executive war power, meaning of free labor, when to break positive law, fate of free people, and meaning of equality. These questions target moments where Lincoln's documented positions shifted, enabling detection of anachronistic compositing when responses reflect later rather than earlier positions.
The evaluation rubric deliberately weights anachronism detection at 40%, documentary consistency at 35%, and contextual plausibility at 25%. Rhetorical authenticity is explicitly excluded to avoid validating fluency-based errors. This weighting reflects the priority structure for historical applications where sounding like the person matters less than reasoning from what the person could have known.
4. Technical Insights
The entire experimental protocol was pre-registered and timestamped before data collection, locking four moments, three conditions, five questions, weighted rubric, and directional predictions before gathering responses. Pre-registration prevents cherry-picking accusations and ensures results carry evidential weight. The predicted pattern expects the bare model to show maximum anachronism, the primary source condition to show minimum anachronism, and the biography condition to show deceptive coherence.
Domain expert evaluation is non-negotiable in this framework. Fidelity is a relation between output and documentary record, not a property of output alone. Automated metrics cannot adjudicate fidelity because it lives in the gap between generated text and archival evidence. A historian scores all outputs using a rubric developed before seeing model responses, with a priori vignettes held under seal to prevent post-hoc rationalization.
The expert requirement generalizes across domains. Reasoning from Stoic texts requires a classicist; reasoning from scripture requires a theologian; reasoning from clinical cases requires a psychologist. The specific expert changes, but the requirement remains constant. However, the expert functions as a build-time and gate-time requirement, not a runtime cost. The expert builds the rubric and gold set once, the instrument gates the pipeline before shipping, and the expert spot-checks edge cases rather than evaluating every output.
Context window approaches offer superior accessibility compared to fine-tuning methods. The requirements are literacy, documents, and access to a frontier model including free-tier access - capabilities available at kitchen tables to doctoral students, community archivists, and grandchildren with family letters. Fine-tuning requires GPUs, training pipelines, dataset curation, and institutional access - capabilities structurally out of reach for most humanists. The architecture admitting the most diverse population of curators is mathematically most likely to surface the documentary anchorings the field needs.
5. Discussion
These findings reframe the relationship between language models and historical scholarship. The question is not what AI can do for historians but what historians, theologians, classicists, and clinicians can do with AI. Disciplines trained to read, contextualize, and interrogate texts can discipline machines that generate text. The historian is not adjacent to the paradigm but the missing instrument.
The threshold case motivates every constraint in this framework: a system that produces convincing fabrications when the persona is a beloved family member is not a research artifact but a violation. When the stakes are that high, every architectural choice must be accountable to the hardest use case, not the easiest. This standard explains why documentary fidelity cannot be sacrificed for fluency, why expert evaluation cannot be replaced with automated metrics, and why temporal anchoring cannot be treated as optional.
Current evaluation practices measure the mask while ignoring the mirror, creating a systematic blind spot for the dominant failure mode. If anachronistic compositing is the primary concern and evaluations measure only fluency and personality consistency, then evaluations cannot detect the failure that matters most. The Prism framework demonstrates that detection requires controlled experiments with temporal variation, documentary grounding, and expert adjudication - infrastructure absent from current benchmarks.
6. Conclusion
This analysis establishes epistemic simulation as a necessary paradigm advancement for historical persona modeling, distinguished by corpus-bounded reasoning, temporal anchoring, and expert-loop evaluation. The Miranda Hypothesis explains how training corpus composition drives anachronistic compositing through salience-weighted compression of cultural representations that vastly exceed primary documentary records. Architectural analysis demonstrates that context window methods preserve documentary provenance while fine-tuning dissolves evidentiary chains through catastrophic forgetting.
The Prism evaluation framework provides pre-registered methodology for detecting compositing through temporal refraction, using domain experts as evaluators rather than automated metrics. This infrastructure makes faithful historical reasoning accessible to archivists and scholars while establishing standards appropriate for applications where fabrication constitutes ethical violation. Future work should extend these methods to other historical figures, test additional architectural variations, and develop tooling that makes pre-registered evaluation protocols standard practice for historical persona systems. The commitment to accessibility is not a populist gesture but a technical argument: diverse curator populations improve system quality by surfacing documentary anchorings that institutional pipelines systematically miss.
Sources
- The Miranda Hypothesis: How Hamilton Poisoned Persona Evals - Jacob E. Thomas, Results Gen - Original Creator (YouTube)
- Analysis and summary by Sean Weldon using AI-assisted research tools
About the Author
Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.