Evals Are Broken, Use Them Anyway — Ara Khan, Cline
Evaluations (evals) are fundamentally broken in how they're currently used, but they remain essential tools when applied correctly through proper interpretat...
By Sean WeldonAbstract
Current evaluation practices for artificial intelligence systems exhibit two opposing methodological failures: uncritical acceptance of benchmark metrics as absolute truth and complete dismissal of quantitative assessment in favor of subjective judgment. This analysis presents a framework for rigorous evaluation methodology that positions benchmarks as essential but imperfect instruments requiring sophisticated interpretation. The framework encompasses three stages: leveraging existing benchmarks, iterative agent optimization, and custom evaluation development. Key findings demonstrate that effective evaluation requires understanding three testing dimensions (model quality, harness architecture, and problem validity), implementing parallel execution infrastructure for isolated environments, and applying model-family-specific optimization techniques. Practical application through Terminal Bench's 89-task coding evaluation suite demonstrates systematic improvement through hill-climbing methodology while avoiding overfitting. The framework enables practitioners to achieve both quantitative performance gains and qualitative product improvements through dual validation approaches.
1. Introduction
The evaluation of artificial intelligence systems, particularly large language models (LLMs) and autonomous coding agents, represents a fundamental methodological challenge in contemporary AI research and development. As these systems advance in capability and complexity, the mechanisms for assessing their performance have become increasingly contested. Two opposing philosophical camps have emerged in evaluation practice, each exhibiting critical limitations that impede effective system development and deployment.
The first camp treats benchmark scores as definitive measures of capability, assuming that models achieving equivalent numeric performance (such as hypothetical GPT 5.4 and Gemini 3.1 Pro with matching scores) deliver identical real-world utility. This approach systematically ignores qualitative differences in model behavior, domain-specific capabilities, and failure modes that numeric aggregates fail to capture. Conversely, the second camp dismisses quantitative evaluation entirely, relying exclusively on subjective assessment—often termed "vibe checks"—without empirical grounding or reproducible methodology.
This analysis examines a comprehensive framework that positions evals (systematic evaluations) as essential instruments requiring rigorous interpretation rather than blind acceptance or wholesale rejection. The framework addresses three critical questions: how to interpret existing benchmarks given their inherent limitations, how to leverage evaluations for iterative system improvement, and how to construct meaningful assessments for complex multi-step agent behaviors. The analysis proceeds through examination of interpretation heuristics, a three-stage evaluation framework, infrastructure requirements for agent assessment, and optimization strategies that enable genuine capability improvements while avoiding benchmark overfitting.
2. Background and Related Work
2.1 The Evaluation Dichotomy in Practice
Contemporary evaluation practices in AI development exhibit two problematic extremes that impede effective system assessment. The objective metrics camp operates under the assumption that benchmark scores constitute ground truth, treating numeric performance as a complete representation of system capability. This perspective fails to account for the multidimensional nature of model performance, where identical aggregate scores may mask substantial differences in error distributions, reasoning approaches, and domain-specific competencies.
The opposing taste camp rejects quantitative evaluation as fundamentally inadequate, instead relying on anthropomorphization and subjective preference without systematic measurement. This approach lacks reproducibility, prevents identification of specific failure modes, and provides no mechanism for tracking improvement over time. The synthesis of these positions recognizes that evaluations constitute neither absolute truth nor meaningless noise, but rather imperfect instruments requiring sophisticated interpretation, proper contextualization, and integration with qualitative assessment.
2.2 Evolution and Saturation of Evaluation Benchmarks
Traditional benchmarks such as HumanEval have experienced capability saturation as frontier models have advanced. OpenAI explicitly acknowledged that HumanEval no longer discriminates between frontier coding capabilities, as state-of-the-art systems achieve near-perfect scores on tasks that were discriminative only years prior. This saturation necessitates continuous development of novel, more challenging evaluations that accurately measure advancing capabilities. Furthermore, the transition from single-turn evaluations (simple question-answer pairs with binary correctness) to multi-step agent evaluations introduces substantial complexity in assessment methodology, requiring evaluation of file reading, documentation search, environment setup, script execution, and test validation across extended interaction sequences.
3. Core Analysis
3.1 Interpretation Heuristics for Evaluation Metrics
Three fundamental heuristics guide proper interpretation of evaluation results in practice. First, published model and application evaluation numbers should be treated as approximations rather than ground truth. The gap between benchmark performance and real-world utility necessitates direct experimentation with systems in target use cases, as aggregate metrics systematically obscure domain-specific performance variations and failure mode distributions that determine practical applicability.
Second, practitioners should maintain currency with frontier model releases while avoiding immediate adoption of new systems. Frontier models evolve on a timescale of several months, and new releases typically exhibit initial instability requiring weeks to resolve through post-deployment refinement. The optimal strategy involves monitoring new releases during an initial stabilization period before conducting systematic evaluation and potential migration, rather than continuously chasing the latest release.
Third, effective evaluation requires seeking both very recent and very precise benchmarks. Standardized legacy evaluations fail to measure frontier capabilities due to saturation, while new evaluations designed for current systems provide discriminative power. Precision in evaluation design—ensuring tasks are non-trivial and representative of target use cases—prevents the meaningless achievement of perfect scores on inadequate assessments.
3.2 Three-Stage Framework for Evaluation Implementation
The framework for systematic evaluation encompasses three progressive stages of increasing sophistication. Stage 1 involves leveraging existing benchmarks developed by other organizations, providing immediate access to standardized assessments without development overhead. This stage enables rapid baseline establishment and comparison against published results.
Stage 2 focuses on using evaluations to iteratively improve proprietary agents and systems. This stage represents the core of practical evaluation application, where systematic measurement enables identification of specific failure modes and targeted optimization. The iterative cycle of evaluation, failure analysis, and refinement constitutes a hill-climbing methodology that drives genuine capability improvements.
Stage 3 involves constructing custom evaluations tailored to specific use cases and requirements. This advanced stage addresses scenarios where existing benchmarks fail to capture critical aspects of target applications, requiring investment in problem curation, infrastructure development, and validation methodology.
3.3 Infrastructure and Execution for Agent Evaluation
Agent evaluation introduces substantial infrastructure requirements distinct from simple model assessment. Each evaluation task requires an isolated environment—typically a dedicated virtual machine with proper repository setup, dependency installation, and configuration management. This isolation prevents cross-contamination between tasks and ensures reproducibility.
Parallel execution infrastructure transforms evaluation efficiency by enabling simultaneous execution of multiple tasks rather than sequential processing. Systems such as Harbor (developed by the Loda Institute) provide standardized Linux configurations with consistent CPU, RAM, and environment specifications, enabling parallel execution of evaluation suites. For Terminal Bench's 89 coding problems, parallel execution reduces total runtime from sequential accumulation (potentially days) to the duration of the slowest individual task (30-40 minutes).
Trace analysis of language model calls during evaluation provides granular insight into failure modes. By examining logs of all LM interactions, practitioners can identify specific failure points (such as "didn't progress past step X" or "retry mechanism malfunction") and extract actionable improvement opportunities. This failure mode allocation enables targeted optimization rather than undirected system modification.
3.4 Multi-Dimensional Testing and Optimization Zones
Effective agent evaluation requires assessment across three distinct dimensions. Model testing evaluates the underlying language model's capabilities, recognizing that strong models can compensate for weak harness architectures and still achieve high scores. Harness testing evaluates the agent framework or coding assistant architecture, as identical models perform differently across different harnesses (for example, Anthropic models demonstrating superior performance with Claude Code compared to alternative agents). Problem quality testing ensures evaluation tasks are non-trivial and representative, as perfect scores on inadequate problems provide no discriminative information.
Optimization proceeds through three zones of increasing subtlety and risk. Zone 1 addresses obvious flaws such as crashes, rate limiting failures, and critical bugs that prevent basic functionality. These improvements are unambiguous and provide immediate performance gains. Zone 2 encompasses nuanced improvements through model-family-specific prompt engineering and architectural refinement. Techniques effective for Anthropic model families often fail for Codex or Gemini families, requiring distinct optimization strategies. This zone represents the core of effective agent optimization and demands sophisticated understanding of model-specific behaviors.
Zone 3 constitutes the danger zone of overfitting, where apparent score improvements result from benchmark-specific exploitation rather than genuine capability enhancement. Excessive optimization for specific evaluation tasks degrades generalization and real-world performance. Avoiding this zone requires dual validation: systems must pass both quantitative benchmarks and qualitative "vibe checks" that assess subjective product quality.
4. Technical Insights
Terminal Bench exemplifies modern agent evaluation design, comprising 89 coding problems representing authentic real-world scenarios including race conditions, database configuration issues, and infrastructure problems. Individual tasks require 30-40 minutes for completion due to multi-step agent operations encompassing file reading, documentation search, environment setup, script execution, and test validation. This temporal complexity ensures task legitimacy by preventing trivial solutions.
The development methodology for high-quality evaluation suites involves analyzing large-scale opted-in user datasets to identify authentic programming problems, followed by manual curation and cleaning to ensure task validity. This approach, employed by Cointreau for their internal benchmarks, balances scale with quality by leveraging real-world problem distributions while maintaining evaluation rigor through expert review.
Practical optimization demonstrates the framework's effectiveness: Cointreau achieved a 43% baseline score on internal benchmarks, with subsequent improvements through CPU and memory allocation adjustments, timeout increases, and thinking behavior optimization. However, optimization revealed counterintuitive findings: excessive thinking tokens can degrade performance by inducing circular reasoning loops, with models sometimes repeating phrases such as "I am a model" for thousands of tokens rather than productive reasoning.
Model-family-specific optimization represents a critical technical insight. Prompt engineering techniques, architectural decisions, and hyperparameter configurations that optimize performance for one model family often degrade performance for others. Cointreau's evaluation revealed strong performance on Anthropic models but weak performance on Gemini, indicating untapped user segments and the necessity of family-specific optimization strategies.
5. Discussion
The framework presented addresses a fundamental tension in AI system evaluation: the need for rigorous quantitative assessment balanced against the limitations of any finite benchmark suite. The three-stage progression from benchmark leverage through iterative improvement to custom evaluation development provides a practical pathway for organizations at varying levels of evaluation maturity. Furthermore, the emphasis on dual validation—achieving both strong quantitative scores and passing qualitative assessment—prevents the pathological optimization patterns that plague benchmark-driven development.
The multi-dimensional testing framework reveals that evaluation is not a monolithic activity but rather encompasses distinct assessments of model capability, harness architecture, and problem validity. This decomposition enables targeted improvement by identifying which component limits overall system performance. The observation that identical models perform differently across harnesses underscores that agent architecture constitutes a genuine research and engineering challenge beyond model selection.
Several areas warrant further investigation. The relationship between benchmark performance and real-world utility remains incompletely characterized, particularly for complex multi-step agent tasks where evaluation approximations may diverge substantially from deployment scenarios. Additionally, the temporal dynamics of evaluation validity—the rate at which benchmarks saturate and require replacement—demands systematic study to inform evaluation development priorities. The model-family-specific nature of optimization techniques suggests that meta-learning approaches for rapid adaptation to new model families may provide substantial value.
6. Conclusion
This analysis presents a comprehensive framework for rigorous evaluation of AI systems that avoids both uncritical acceptance of benchmark metrics and complete dismissal of quantitative assessment. The three-stage framework (leverage, improve, build) provides a practical progression for evaluation implementation, while the three-dimensional testing model (model, harness, problem quality) enables targeted optimization. Infrastructure requirements for parallel execution in isolated environments and trace-based failure analysis constitute essential technical capabilities for effective agent evaluation.
The key practical takeaway emphasizes dual validation: systems should achieve strong quantitative performance on legitimate benchmarks while simultaneously passing qualitative assessment of real-world utility. This balanced approach prevents overfitting while enabling systematic improvement through hill-climbing methodology. Organizations should implement model-family-specific optimization strategies, recognizing that techniques effective for one model family often fail for others. As frontier models continue to advance and saturate existing benchmarks, continuous development of novel, challenging evaluations remains essential for meaningful capability assessment. The framework enables practitioners to navigate the complex landscape of AI evaluation, treating benchmarks as essential but imperfect instruments requiring sophisticated interpretation and application.
Sources
- Evals Are Broken, Use Them Anyway — Ara Khan, Cline - Original Creator (YouTube)
- Analysis and summary by Sean Weldon using AI-assisted research tools
About the Author
Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.