Fighting AI with AI — Lawrence Jones, Incident

Building complex AI systems requires using AI itself as a tool in internal debugging and analysis systems to manage complexity, scale evaluation, and continu...

By Sean Weldon

Abstract

Modern production AI systems exhibit architectural and operational complexity that surpasses human capacity for manual debugging and systematic improvement. This analysis examines methodologies for deploying AI agents as internal development tools to manage system complexity, scale evaluation processes, and enable continuous improvement of multi-component AI products. The approach integrates three technical components: evaluation frameworks with agent-compatible interfaces, file system-based debugging representations optimized for large language model analysis, and parallelized investigation pipelines for large-scale backtest interpretation. Empirical evidence from production systems demonstrates that agent-assisted workflows reduce debugging cycles from hours to minutes while enabling systematic identification of failure patterns across thousands of cases. These methodologies provide a generalizable framework for organizations developing complex AI systems requiring continuous observability and iterative refinement.

1. Introduction

The deployment of artificial intelligence systems in production environments has introduced fundamental challenges in system observability, debugging, and continuous improvement. Contemporary AI products frequently comprise dozens of interconnected components—including multiple agents, prompts, and tool integrations—operating in complex hierarchical arrangements. A representative production chatbot may contain over ten distinct agents and fifty individual prompts, while investigation systems generate execution traces containing hundreds or thousands of granular operations. Traditional debugging methodologies, designed for deterministic software systems with predictable failure modes, prove inadequate when applied to stochastic AI components exhibiting emergent behaviors and subtle error propagation.

This analysis examines a paradigm shift in AI system development: the strategic deployment of AI agents as first-class tools within internal debugging, evaluation, and analysis infrastructure. Rather than treating AI solely as a product output, this approach recognizes that managing AI system complexity necessitates AI-assisted tooling throughout the development lifecycle. The central thesis posits that production AI systems—defined as multi-component architectures integrating numerous prompts, agents, and tools to accomplish complex tasks—cannot be tractably debugged, evaluated, or improved through manual human analysis alone.

The complexity manifests across multiple dimensions. A single incident investigation typically requires approximately one hour of human analysis time, involving hundreds of telemetry queries across logs, metrics, traces, and historical incident data. Behind each investigation exist hundreds or thousands of individual prompt executions, making it difficult to identify which specific component contributes to system-level failures. Furthermore, subtle errors in complex pipelines can propagate undetected through downstream components, ultimately producing completely incorrect conclusions about incidents. This synthesis examines technical solutions addressing these challenges through agent-compatible tooling and automated analysis pipelines.

2. Background and Related Work

2.1 Evaluation Frameworks for AI Systems

Evals function as unit tests for AI systems, providing structured assessment of prompt behavior. Each eval comprises three components: input data representing realistic usage scenarios, a prompt execution step, and grading criteria determining pass/fail status. In software engineering practice, evals are maintained as YAML files colocated with prompt definitions in version-controlled codebases, enabling systematic regression testing as prompts evolve.

However, production evals present unique maintenance challenges. Realistic test cases frequently require complete incident reports or extensive contextual data to trigger problematic behaviors, resulting in YAML files exceeding 2 MB in size. These files surpass context length limits for large language models, preventing coding agents from effectively modifying or iterating on test suites. This limitation creates a paradox: the evaluation infrastructure intended to ensure system quality becomes itself too complex for efficient maintenance.

2.2 Observability in Multi-Component AI Systems

Production investigations generate observability data across multiple dimensions, requiring integration of logs, metrics, distributed traces, and historical records. Investigation systems unpack each operational step into granular traces containing hundreds of prompts and tool calls, creating a comprehensive but overwhelming data landscape. Traditional observability tools designed for human consumption—dashboards, query interfaces, and visualization layers—provide inadequate interfaces for systematic analysis at scale. The volume and complexity of execution traces exceed human capacity for pattern recognition and root cause identification, particularly when analyzing thousands of investigations across hundreds of customer accounts in daily backtests.

3. Core Analysis

3.1 Agent-Compatible Evaluation Infrastructure

The first technical component addresses the maintainability challenge of large-scale evaluation suites through agent-compatible interfaces. A command-line tool—the eval CLI—provides structured operations for querying, editing, replacing, and adding test cases without requiring agents to load entire evaluation files into context. This interface enables coding agents to manipulate evaluation suites programmatically, creating a foundation for automated prompt improvement workflows.

Automated eval runbooks define multi-stage processes for agent-driven prompt refinement. These runbooks specify stages including: creating failing test cases that reproduce observed issues, modifying prompts to address failures, checking for regressions across the existing test suite, and consolidating prompts to reduce system bloat. Coding agents execute these runbooks iteratively, repeatedly improving prompts while maintaining test suite integrity. This pattern proves effective for single-prompt modifications but does not address multi-prompt system debugging requiring cross-component analysis.

The red-green eval cycle provides validation for prompt changes, ensuring that modifications address target failures without introducing regressions. This methodology parallels test-driven development practices in traditional software engineering, adapted for the stochastic nature of language model outputs. Agents can verify fixes systematically before deployment, reducing the risk of degraded production performance.

3.2 File System-Based Debugging Interfaces

The second technical component reconceptualizes debugging interfaces for optimal agent interaction. Rather than exposing observability data through graphical user interfaces or specialized query languages, this approach converts debugging information into downloadable file systems that agents navigate using standard tools. Execution traces, prompts, tool calls, and relevant code context are serialized into text-format files organized in intuitive directory structures.

This design leverages a critical insight: coding agents such as Claude Code demonstrate exceptional proficiency with file system navigation and bulk data analysis using standard command-line tools like grep. File systems provide inherent self-documentation through directory structure and naming conventions, allowing agents to understand system architecture without explicit documentation. Furthermore, agents can access the production codebase alongside traces and prompts, enabling identification of exact code locations requiring modification.

The workflow proceeds as follows: download an interaction's complete execution context as a file system, point a coding agent at the filesystem root, allow the agent to identify problems through autonomous exploration and pattern matching, receive agent-generated suggestions for code changes, and verify fixes through the automated eval runbook. This approach reduces investigation time from approximately one hour to minutes, while providing more systematic analysis than manual debugging.

3.3 Scalable Analysis Pipelines for Backtest Interpretation

The third technical component addresses interpretation of large-scale backtests through parallelized agent analysis. Daily backtests execute thousands of investigations across hundreds of customer accounts, producing aggregate metrics such as overall root cause analysis accuracy (e.g., 86%). However, aggregate metrics obscure underlying performance patterns, failing to explain why performance improved or degraded between versions or why systems perform differently across customer segments.

The scrapbook analysis pipeline employs structured markdown playbooks defining investigation analysis procedures for agents. The pipeline operates in three stages. First, approximately 25 agents run in parallel, each analyzing individual investigations and storing findings in structured files. Second, a cohort clustering stage groups similar failure patterns across investigations, identifying systematic issues affecting multiple cases. Third, synthesis agents generate actionable reports identifying why systems perform well or poorly on specific customer accounts and recommending specific code modifications.

This architecture provides critical advantages over manual analysis. Analysis agents store findings incrementally in files within the download filesystem, enabling resumable analysis if processes are interrupted. The parallel execution model scales linearly with investigation volume, maintaining consistent analysis throughput as backtest coverage expands. Most significantly, the pipeline produces specific, actionable recommendations directly informing code changes and feature modifications, rather than high-level observations requiring further investigation.

4. Technical Insights

Several technical insights emerge from production deployment of these methodologies. First, debugging tool design must prioritize agent compatibility over human usability when building internal infrastructure for complex AI systems. File system interfaces prove superior to alternative approaches such as Model Context Protocol (MCP) implementations or custom query languages, as bulk downloads enable efficient pattern matching and contextual analysis that other interfaces cannot support.

Second, evaluation infrastructure requires explicit design for agent manipulation. The 2 MB evaluation file problem exemplifies a broader principle: tools designed for human use frequently create bottlenecks for agent-assisted workflows. The eval CLI tool resolves this through targeted operations that avoid context length limitations while maintaining evaluation integrity.

Third, parallelized agent analysis provides practical time savings measured in days or weeks of engineering effort. Manual analysis of thousands of investigations would require prohibitive human resources, while the scrapbook pipeline completes comprehensive analysis in hours. This efficiency gain enables continuous monitoring and improvement cycles that would otherwise be economically infeasible.

Fourth, the integration of analysis findings with development workflows through the red-green eval cycle creates a closed-loop improvement system. The complete cycle—identify problem through backtest analysis, analyze with parallelized agents, generate specific code changes, validate through evals, deploy to production, monitor through subsequent backtests—operates with minimal human intervention beyond approval gates.

Implementation considerations include computational resource allocation for parallel agent execution, storage requirements for file system downloads across thousands of investigations, and prompt engineering for analysis agents to ensure consistent output formatting compatible with downstream synthesis stages. Trade-offs exist between analysis depth and computational cost, requiring calibration based on system criticality and available resources.

5. Discussion

These methodologies represent a fundamental reconceptualization of AI system development practices. Traditional software engineering treats testing and debugging as human-centric activities supported by automated tooling. This approach inverts that relationship: agents become primary consumers of debugging infrastructure, with human engineers operating at higher levels of abstraction through agent-generated insights and recommendations.

The generalizability of these patterns across different AI systems and organizations suggests broader implications for the field. As AI products increase in complexity—a trajectory evident across the industry—the gap between system complexity and human debugging capacity will continue to widen. Organizations that develop agent-compatible internal tooling will possess significant advantages in iteration speed and system reliability compared to those relying on manual analysis.

However, several knowledge gaps remain. The optimal granularity of agent analysis tasks requires further investigation; excessively narrow tasks may miss cross-component interactions, while overly broad tasks may exceed agent reasoning capabilities. The reliability of agent-generated code modifications in production systems warrants systematic study, particularly regarding edge cases and security implications. Additionally, the interaction between human domain expertise and agent analysis capabilities deserves deeper examination—determining which decisions require human judgment versus agent autonomy remains an open question.

The integration of these methodologies with emerging practices in AI system development, including constitutional AI, chain-of-thought prompting, and multi-agent architectures, presents opportunities for future research. The scrapbook analysis pipeline itself could benefit from meta-analysis: using agents to identify patterns in how analysis agents identify patterns, creating recursive improvement cycles for the debugging infrastructure itself.

6. Conclusion

This analysis demonstrates that managing complex production AI systems requires AI-assisted tooling throughout the development lifecycle. The integration of agent-compatible evaluation frameworks, file system-based debugging interfaces, and parallelized analysis pipelines provides a practical methodology for organizations developing multi-component AI products. Empirical evidence from production deployments indicates substantial reductions in debugging time and systematic identification of failure patterns previously obscured by aggregate metrics.

The key practical takeaway is straightforward: organizations should design internal tooling for agent consumption rather than human use when building complex AI systems. File systems, command-line interfaces, and structured analysis pipelines enable agents to operate effectively at scale, while graphical interfaces and aggregate dashboards create bottlenecks in the improvement cycle. The automated eval runbook pattern and scrapbook analysis pipeline provide concrete implementations applicable across different AI architectures and problem domains.

Future applications of these methodologies extend beyond debugging to encompass system design, feature prioritization, and customer-specific optimization. As AI systems continue to increase in complexity, the competitive advantage will accrue to organizations that effectively leverage AI tools for internal development processes, not merely for customer-facing products. The paradigm of "fighting AI with AI" represents not a temporary expedient but a necessary evolution in software engineering practice for the AI era.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub