Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

Building agents that run for extended periods (5-6+ hours) requires co-evolving improvements to both the underlying model capabilities and the harness scaffo...

By Sean Weldon

Abstract

This synthesis examines the engineering principles underlying extended-duration autonomous agents, focusing on systems capable of operating for 5-6+ hours without human intervention. Analysis of Claude's progression from 20-minute operational windows (Sonnet 3.5) to 12-hour continuous runs (Opus 4.6) reveals that reliable long-running agents require co-evolutionary development of both foundational model capabilities and surrounding harness infrastructure. The generator-evaluator adversarial pattern—wherein separate agents assume building and critique roles across isolated context windows—effectively addresses core challenges of context management, planning deficiencies, and self-evaluation failures. Empirical results demonstrate 12x improvements in operational duration within one year, with practical implementations producing functional software artifacts including fully playable games at $200 cost over 6-hour sessions. These findings have immediate implications for autonomous software development, quality assurance workflows, and multi-agent system design.

1. Introduction

The development of autonomous agents capable of sustained operation represents a critical frontier in artificial intelligence research. While short-duration agents have demonstrated competence in constrained tasks, extending operational windows to multiple hours introduces fundamental challenges in context management, planning coherence, and quality assessment. This synthesis examines the technical evolution of long-running agents—systems designed to operate autonomously for 5-6+ hours—through the lens of Claude's development trajectory from Sonnet 3.5 through Opus 4.6.

The magnitude of progress in this domain is substantial. A year prior to this analysis, Claude struggled with basic bash command execution and string escaping, maintaining operational coherence for approximately 20 minutes. Contemporary implementations achieve effectively multi-day execution windows, with the agent development environment itself being constructed primarily by autonomous agents. This represents not merely incremental improvement but a qualitative shift in capability.

The central research question addresses how to architect agent systems that maintain task coherence, avoid context degradation, and produce high-quality outputs across extended timeframes. Rather than treating model capabilities and scaffolding infrastructure as independent variables, this work demonstrates their necessary co-evolution. The analysis introduces the generator-evaluator adversarial pattern as a solution architecture, drawing parallels to Generative Adversarial Networks while adapting the framework for agent harness design. This synthesis proceeds by establishing the historical context of agent evolution, analyzing core technical challenges, presenting the generator-evaluator pattern and its implementation, and concluding with practical implications for autonomous software development.

2. Background and Related Work

The progression toward extended-duration agents required systematic development of core primitives. Key capabilities shipped sequentially include: artifacts (persistent output containers), computer use (direct system interaction), Model Context Protocol (MCP) specification (standardized tool integration), checkpoints (state tracking across sessions), skills (reusable capability packages), server-side compaction (context management without client intervention), and 1 million token context windows (extended memory capacity). These primitives did not emerge independently but rather co-evolved with model releases, treating the system holistically rather than as separable components.

The Ralph Loop framework established an iterative execution pattern: planning, feature selection, implementation, testing, and version control commits repeated until task completion. While effective for bounded tasks, this framework exhibited limitations in extended contexts. The principle of being "deterministically bad in an undeterministic world"—failing predictably rather than succeeding unpredictably—informed subsequent harness designs prioritizing reliability over optimistic execution paths. The transition from Opus 3.7 achieving 1-hour agent runs with minimal scaffold at 50% task completion to Opus 4.6 reaching 12 hours represents a 12x improvement in operational duration within a single year, demonstrating rapid capability expansion.

3. Core Analysis

3.1 Fundamental Challenges in Extended Agent Operation

Extended agent runs expose three critical failure modes that short-duration tasks mask. First, context window limitations manifest as amnesia at session initialization and progressive degradation as context depth increases. Models exhibit context anxiety when approaching window boundaries, producing degraded outputs even with available capacity. Second, models demonstrate systematic planning deficiencies: attempting comprehensive implementation in single iterations, constructing incomplete features, or exhausting context mid-task without completion strategies. Third, models prove inadequate at self-evaluation, exhibiting sycophancy bias, marking incomplete features as finished, and constructing non-functional backends that superficially appear complete.

These challenges admit three solution categories: improving base model weights through training, modifying harness scaffolding architecture, or combining both approaches. Empirical evidence demonstrates that harness-only solutions reach capability ceilings determined by underlying model limitations, while model-only improvements without corresponding harness evolution fail to fully exploit new capabilities. The co-evolutionary approach proves necessary for sustained progress.

3.2 The Generator-Evaluator Adversarial Architecture

The generator-evaluator adversarial pattern addresses self-evaluation failures by decomposing roles into separate context windows with adversarial pressure. The generator agent constructs implementations while the evaluator agent performs quality assessment using live testing frameworks such as Playwright for browser-based validation. This architectural separation exploits an asymmetry in tractability: tuning a standalone critic to apply harsh evaluation standards proves feasible, whereas tuning a builder to maintain self-critical assessment does not.

The evaluator performs actual browser interaction and functional testing rather than merely reviewing code diffs, detecting runtime failures that pass static analysis. Critically, the evaluator maintains authority to reject entire outputs and mandate restarts from scratch when unable to incrementally improve against established rubrics, unlike single-agent loops that persistently attempt to patch failing approaches. This willingness to discard work and pivot fundamentally distinguishes the pattern from iterative refinement strategies.

3.3 Contract Negotiation and Subjective Quality Assessment

Before implementation begins, generator and evaluator agents negotiate contracts specifying completion criteria, test requirements, and edge case handling. This negotiation establishes shared understanding of deliverable expectations, bridging user stories and testable assertions without requiring upfront technical specification. Empirical results indicate that 27 contract criteria prove necessary for actionable feedback; vague criteria produce vague critiques that agents disregard.

Subjective qualities including design aesthetics can be systematically graded through detailed rubrics encoding strong opinions. A four-criteria framework evaluates design, originality, craft, and functionality, with weighting toward design and originality to prevent convergence on generic "AI slop" aesthetics. The evaluator calibrates taste through few-shot examples on reference implementations, converging on desired aesthetic standards. This approach enables the generator-evaluator pattern to pivot when blocked on specific criteria, discarding unsuccessful approaches rather than iteratively patching the same solution path.

3.4 Three-Agent Harness Implementation

Production implementations extend the two-agent pattern to three specialized roles: planner, generator, and evaluator. The planner decomposes single-line prompts into high-level specifications with sprint-based workflows, deliberately avoiding granular technical details that cascade errors through implementation. The generator and evaluator subsequently negotiate contracts via file-based handoffs before code generation commences. This architecture positions the planner to establish product boundaries while delegating exact feature sets and test criteria to the generator-evaluator negotiation.

Empirical validation demonstrates substantial quality improvements. A retro game implementation using the three-agent harness ($200 cost, 6-hour execution) produced fully playable gameplay with functional AI opponents, working physics systems, and debug interfaces. Comparable single-agent loops generated non-functional play modes despite similar resource allocation. The harness version's contract included 27 criteria ensuring actionable feedback at each evaluation cycle.

4. Technical Insights

4.1 Model-Specific Harness Adaptation

Harness design must adapt to model-specific behavioral characteristics and capability profiles. Opus 4.5 required explicit context resetting and sprint-based task decomposition, while Opus 4.6 handles continuous 2-hour sessions without forced feature segmentation. Context anxiety present in Opus 4.5 was eliminated in Opus 4.6 through post-training interventions. Evaluator invocation cadence shifted from per-sprint to end-of-generation, simplifying control flow while maintaining output quality. This demonstrates that harness components critical for earlier model generations become unnecessary as capabilities advance—the frontier moves continuously rather than contracting.

4.2 Implementation Architecture and Tooling

Filesystem-based state management proves superior to context window storage for long-running agents. Models demonstrate lower propensity to overwrite JSON-formatted state files compared to markdown-embedded state. Embedding learnings and state in timestamped JSON files creates persistent breadcrumbs for subsequent model iterations. For browser control, Playwright MCP or Claude for Chrome MCP provide reliable interfaces, with contemporary vision capabilities sufficient for identifying UI element overlap and layout issues.

Programmatic tool calling reduces context overhead by executing tool invocations and returning final results rather than loading all intermediate outputs into context. Progressive disclosure in skills implements tiered loading: front matter loads initially, full bodies load on instantiation, and code references execute deterministically. Auto mode execution provides safer operation than unlimited permissions, while custom sub-agents with harsh system prompts effectively implement specialized evaluation roles.

4.3 Debugging and Observability

The primary debugging methodology involves line-by-line trace analysis rather than increased experimentation volume. Effective debugging identifies divergence points between model judgment and human assessment. Claude exhibits poor out-of-box performance as a QA agent due to sycophancy bias, requiring extensive tuning on layout bugs, edge cases, and specific failure modes. A meta-level debugging approach pipes agent transcripts into files for analysis by separate agents, closing the improvement loop on harness design.

Harness development requires empathetic skill modeling: imagining navigation constraints with 10-second visual snapshots provides insight into model limitations. Observability and traceability for multi-agent systems remains an unsolved challenge, currently relying on manual trace reading and custom prompt-based analysis rather than systematic instrumentation.

5. Discussion

The generator-evaluator adversarial pattern demonstrates that architectural decomposition can address fundamental model limitations that training alone does not resolve. The asymmetry between critique tractability and self-critique tractability suggests broader implications for agent design: separating evaluation from generation may prove necessary across domains where quality assessment differs from production capability. The finding that 27 contract criteria prove necessary for actionable feedback indicates that vague specifications produce vague outputs regardless of underlying model capability, emphasizing the importance of precise requirement articulation.

The co-evolutionary relationship between models and harnesses challenges the assumption that improved models automatically enable better agent systems. Each model generation exhibits distinct behavioral profiles requiring corresponding harness adaptations. Context anxiety, sprint decomposition requirements, and evaluator cadence all vary across model versions, suggesting that harness design cannot be treated as model-agnostic infrastructure. This coupling implies that as models continue improving, harness architectures must continuously adapt rather than converging on stable patterns.

The pattern's primary applicability to greenfield projects rather than brownfield codebases represents a significant limitation. Adapting the approach to existing codebases requires customized rubrics and increased human oversight, suggesting that the adversarial pattern's benefits derive partly from the structured nature of new implementations. Future work should investigate how contract negotiation and adversarial evaluation extend to monitoring, issue generation, pull request creation, and review workflows within broader software development lifecycles. Additionally, multi-agent collaboration on shared harnesses remains underdeveloped, with current adoption following bottom-up patterns where original implementers maintain composability.

6. Conclusion

This synthesis demonstrates that achieving reliable long-running autonomous agents requires co-evolutionary development of model capabilities and harness architecture. The generator-evaluator adversarial pattern provides a tractable solution to self-evaluation failures by exploiting the asymmetry between critique and self-critique tractability. Empirical results show 12x improvements in operational duration within one year, with practical implementations producing functional software artifacts at reduced cost compared to single-agent approaches.

Key technical contributions include: (1) identification of context anxiety and planning deficiencies as primary failure modes in extended runs, (2) demonstration that contract negotiation with 27 criteria enables actionable feedback, (3) validation that subjective qualities like design taste can be systematically graded through detailed rubrics, and (4) evidence that harness designs must adapt to model-specific behavioral profiles rather than remaining static. Practical applications span autonomous software development, quality assurance automation, and multi-agent system design. Future research should address brownfield adaptation, multi-agent collaboration frameworks, and systematic observability solutions for complex agent systems.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub