The maturity phases of running evals — Phil Hetzel, Braintrust

Agent evaluation maturity progresses through four distinct stages—from initial vibe checking with human annotation to advanced techniques incorporating produ...

By Sean Weldon

The Maturity Phases of Running Evals: A Framework for Agent Evaluation Systems

Abstract

Agent evaluation represents a critical challenge in deploying production AI systems, requiring systematic approaches that balance quality assurance with practical implementation constraints. This paper examines a four-stage maturity model for agent evaluation, progressing from initial human annotation through scaled LLM-as-judge techniques, complex trace evaluation with external system integration, and automated failure mode detection. The framework addresses three primary risk dimensions—reputational, systems, and compliance—while enabling continuous improvement through iterative optimization tracking. Key contributions include the "flywheel" methodology for incorporating production traces into offline evaluation, techniques for managing external system state during complex evaluations, and hybrid approaches combining LLM judges with deterministic checks. The analysis demonstrates that effective evaluation requires domain-specific annotation, systematic validation of LLM judges themselves, and strategic focus on specific failure modes rather than exhaustive testing approaches.

1. Introduction

The deployment of autonomous agents in production environments introduces fundamental challenges in quality assurance that diverge significantly from traditional software engineering paradigms. Unlike conventional software systems where unit testing provides comprehensive coverage of defined behaviors, agent systems exhibit effectively infinite potential failure modes, necessitating fundamentally different evaluation approaches. This reality creates a tension between the desire for thorough testing and the practical need to ship functional systems within reasonable timeframes.

Agent evaluation, commonly termed "evals," serves dual strategic purposes within organizations. Defensively, evals provide risk mitigation across three critical dimensions: reputational risk (agents exhibiting unkind or unhelpful behavior), systems risk (excessive computational costs or resource consumption), and compliance risk (agents operating outside prescribed boundaries). Offensively, evals enable optimization by providing measurable feedback on improvements resulting from iterative agent modifications. This dual nature positions evaluation as both a quality gate and a development accelerator.

This synthesis examines a structured maturity model comprising four distinct stages of evaluation sophistication. The framework encompasses three core primitives: the task (agent or prompt under test), the dataset (examples initiating agent execution), and scoring functions (mechanisms for judging output quality). Critically, this analysis demonstrates that effective evaluation requires accepting directional accuracy rather than demanding perfect precision, focusing strategically on high-impact failure modes rather than attempting exhaustive coverage, and systematically validating the evaluation mechanisms themselves.

2. Background and Related Work

2.1 Evaluation Primitives and Philosophical Foundations

Agent evaluation systems comprise three fundamental components that distinguish them from traditional testing frameworks. The task component represents the specific agent configuration or prompt under examination. The dataset consists of examples that initiate agent execution, with optimal implementations drawing from production or user acceptance testing (UAT) environments rather than relying on synthetic data generation. The scoring function judges output quality through either deterministic code-based checks or LLM-as-judge techniques, with both approaches offering distinct advantages depending on evaluation objectives.

A critical philosophical distinction separates evals from traditional software testing methodologies. Unit tests aim for comprehensive coverage of defined system behaviors, operating under the assumption that all relevant states and transitions can be enumerated and verified. Agent systems, by contrast, present infinite potential failure modes due to their natural language interfaces and probabilistic outputs. Attempting exhaustive testing in such environments would consume development resources disproportionately, as teams would "spend all of their time writing tests and none of their time shipping." Consequently, effective eval strategies focus strategically on specific, high-impact failure modes identified through production experience, domain expertise, or systematic analysis of user interactions.

2.2 The Flywheel Methodology

The flywheel represents a cyclical methodology for continuous agent improvement through production-informed evaluation. This framework operates through four sequential phases: capturing production traces from actual user interactions, identifying failure patterns within those traces, bringing representative examples into offline evaluation environments, and using evaluation results to guide agent improvements. This cycle creates a feedback loop where production experience directly informs development priorities, ensuring that evaluation efforts target real-world failure modes rather than hypothetical edge cases.

The flywheel methodology addresses a fundamental challenge in agent evaluation: the difficulty of anticipating all relevant failure modes through purely synthetic means. By grounding evaluation datasets in actual production traces, teams ensure that their testing efforts reflect genuine user needs and interaction patterns. This approach contrasts sharply with traditional software testing, where synthetic test cases often suffice because system behaviors are more deterministic and bounded.

3. Core Analysis

3.1 Maturity Level 1: Establishing Foundational Practices

The initial maturity stage embraces pragmatic starting points that prioritize action over perfection. Vibe checking—informal assessment of agent outputs—represents an acceptable entry point, explicitly superior to operating without any evaluation framework. This stage typically involves human annotators reviewing approximately ten example agent outputs, providing binary assessments (thumbs up/down) accompanied by written justifications.

The critical innovation at this stage lies not in the binary assessments themselves, but in the justifications provided by annotators. These written explanations extract domain-specific knowledge and failure mode identification from human experts, creating a foundation for subsequent scaling efforts. The framework emphasizes that subject matter experts should perform annotation rather than generic annotators, as domain expertise proves essential for identifying subtle failure modes that may not be obvious to non-specialists.

Technical implementation at this stage requires evaluation platforms providing custom annotation views tailored to specific agent traces rather than generic interfaces. This customization enables annotators to efficiently review relevant context and provide meaningful feedback without navigating inappropriate or cumbersome interfaces designed for different evaluation scenarios.

3.2 Maturity Level 2: Scaling Through Automation

The second maturity stage focuses on reducing dependency on human annotators while preserving the domain knowledge they provide. Teams achieve this transition by extracting failure modes from annotator justifications through either code analysis or LLM-based extraction techniques, then encoding these patterns into automated scoring functions.

LLM-as-judge techniques emerge as the primary scaling mechanism, enabling teams to apply human-like judgment across larger datasets without proportional increases in expert annotation time. However, this stage introduces an important caveat: LLM judges require systematic validation rather than blind trust. The framework explicitly states that "just because you put a robe and a cloak on an LLM, that doesn't make it inherently more trustworthy."

This maturity level also emphasizes hybrid evaluation approaches combining LLM judges with deterministic code-based checks. Deterministic evaluations prove particularly effective for objective metrics such as tool call counts, token usage thresholds, or structural requirements. For instance, an evaluation might programmatically verify that an agent never exceeds five tool calls per interaction, while simultaneously using an LLM judge to assess response helpfulness—a more subjective dimension.

Dataset composition becomes critical at this stage. Rather than relying on synthetic examples, teams should gather production or UAT-level traces, implementing the flywheel methodology to ensure evaluation efforts target real-world failure modes. This approach creates a virtuous cycle where production experience continuously informs and improves offline evaluation capabilities.

3.3 Maturity Level 3: Managing External System Complexity

The third maturity stage addresses agents that perform external system interactions, requiring evaluation of entire execution traces rather than merely final outputs. This introduces significant technical complexity, particularly when agents execute CRUD operations (create, read, update, delete) against production systems or databases.

The framework distinguishes between two categories of tool calls: context-gathering tools that inject data into agent reasoning processes, and CRUD-based tools that modify external system state. CRUD operations create two critical challenges for evaluation systems. First, evaluations must accurately represent external system state as it existed at the time of original task execution. Second, evaluation runs must avoid overwriting production data or creating unintended side effects in live systems.

Two technical approaches address these challenges. First, traces can encapsulate arbitrary amounts of context, including snapshots of relevant system state at execution time. This encapsulation reduces dependency on separate test infrastructure by embedding necessary context directly within evaluation datasets. Second, timestamp-based querying techniques enable version queries to systems like vector databases, reconstructing historical system state without requiring separate test environments. For example, querying a vector database with a timestamp parameter can retrieve the exact data state that existed when the original agent interaction occurred.

This approach reflects a philosophical shift in evaluation thinking. Rather than conceptualizing evals as traditional tests run against static fixtures, the framework encourages teams to "think about evals as rerunning production," emphasizing fidelity to actual operational conditions over idealized test scenarios.

3.4 Maturity Level 4: Advanced Automation and Meta-Evaluation

The fourth maturity stage introduces sophisticated automation techniques and systematic validation of evaluation mechanisms themselves. Topic modeling at scale enables automatic discovery of failure mode clusters within production data, identifying patterns that human analysts might overlook or that emerge only at sufficient scale.

Automated eval execution becomes feasible through cloud-based code execution and command-line interfaces provided by evaluation platforms. This automation enables continuous evaluation workflows where agent modifications trigger comprehensive eval runs without manual intervention, providing rapid feedback on changes.

Critically, this stage emphasizes evaluating the evaluators—systematic validation of LLM judge outputs against human ground truth. Because LLM judge outputs are discrete (typically categorical assessments or numerical scores), teams can create ground truth datasets by having human experts label a subset of examples, then measuring alignment between LLM judge outputs and human judgments. This meta-evaluation ensures that automated scoring functions remain aligned with human values and domain-specific quality criteria as agents evolve.

4. Technical Insights

Several actionable technical findings emerge from this maturity framework. First, evaluation datasets should prioritize production or UAT-level traces over synthetic examples, as real-world data better represents actual failure modes and user interaction patterns. This principle applies across all maturity stages but becomes particularly critical as evaluation sophistication increases.

Second, trace-based evaluation offers significant advantages over output-only assessment for agents performing external system interactions. Traces can encapsulate arbitrary context including system state snapshots, reducing infrastructure complexity while improving evaluation fidelity. Timestamp-based querying techniques provide an elegant solution for reconstructing historical system state without maintaining separate test environments.

Third, hybrid evaluation approaches combining LLM judges with deterministic code-based checks prove more robust than either technique alone. Deterministic checks excel at objective metrics (tool call counts, token usage, structural requirements) while LLM judges handle subjective dimensions (helpfulness, tone, appropriateness). This combination leverages the strengths of each approach while mitigating their respective limitations.

Fourth, LLM-as-judge implementations require systematic validation rather than blind acceptance. The discrete nature of LLM judge outputs enables creation of ground truth datasets for meta-evaluation, ensuring that automated scoring remains aligned with human judgment. Organizations should regularly audit LLM judge performance against human-labeled examples, particularly when modifying judge prompts or underlying models.

Finally, the framework acknowledges important trade-offs in evaluation design. Perfect accuracy proves neither achievable nor necessary—directional trends suffice for guiding agent improvements. The non-deterministic nature of LLM judges remains acceptable provided outputs trend correctly across multiple runs. Teams must balance evaluation thoroughness against development velocity, recognizing that shipping functional systems requires strategic focus on high-impact failure modes rather than exhaustive testing.

5. Discussion

The maturity model presented here reflects broader trends in AI system development, where traditional software engineering practices require substantial adaptation for probabilistic, natural language systems. The progression from human annotation through automated evaluation mirrors similar trajectories in machine learning model development, where initial manual labeling eventually gives way to automated quality assessment mechanisms. However, the framework's emphasis on continuous validation—particularly the concept of evaluating evaluators—represents an important safeguard against over-automation.

Several areas merit further investigation. First, the optimal balance between deterministic and LLM-based evaluation likely varies across domains and use cases, yet systematic guidance for making these trade-offs remains limited. Second, the scalability limits of trace-based evaluation deserve examination, particularly for agents generating extremely large or complex execution traces. Third, the relationship between offline evaluation performance and production outcomes requires empirical validation—high eval scores should correlate with positive production metrics, but this relationship may not hold uniformly across agent types or domains.

The framework's emphasis on production trace incorporation aligns with broader industry movements toward continuous integration and deployment for AI systems. However, this approach assumes organizations possess sufficient production traffic to generate meaningful evaluation datasets. Early-stage products or agents serving specialized use cases may require alternative strategies for dataset construction, potentially including careful synthetic data generation or transfer learning from related domains.

6. Conclusion

This analysis presents a structured maturity model for agent evaluation comprising four distinct stages: initial human annotation with vibe checking, scaled automation through LLM-as-judge techniques, complex trace evaluation with external system integration, and advanced automation with meta-evaluation. The framework demonstrates that effective evaluation requires accepting directional accuracy over perfect precision, focusing strategically on specific failure modes rather than attempting exhaustive coverage, and systematically validating evaluation mechanisms themselves.

Key practical takeaways include prioritizing production traces over synthetic data for evaluation datasets, implementing hybrid approaches combining deterministic checks with LLM judges, encapsulating system state within traces to simplify infrastructure requirements, and establishing systematic validation of LLM judge outputs against human ground truth. Organizations implementing this framework should progress through maturity stages sequentially, establishing solid foundations at each level before advancing to more sophisticated techniques. The flywheel methodology—capturing production traces, identifying failures, conducting offline evaluation, and implementing improvements—provides a cyclical process for continuous agent enhancement grounded in real-world performance data.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub