Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, CEO, Trigger.dev
Agents represent a fundamental shift in backend infrastructure from stateless to stateful compute, requiring a dual approach of durable context logging and e...
By Sean WeldonTwo Roads to Durable Agents: Architectural Patterns for Stateful AI Infrastructure
Abstract
This paper examines the architectural transformation required to support durable AI agents, demonstrating that the shift from stateless to stateful backend infrastructure represents a fundamental paradigm change in computing architecture. The analysis reveals that traditional replay-based durability models, while effective for bounded workflows, fail catastrophically when applied to agent loops due to exponentially growing logs and fundamental scalability limits. A dual-architecture solution is proposed: append-only context logging for conversation history combined with snapshot-and-restore mechanisms for execution state. Implementation using Firecracker micro VMs with seekable compression achieves sub-second snapshot times and 200-millisecond restore latency while reducing storage requirements from 512MB to 14MB per snapshot. These findings provide a foundational framework for building production-grade durable agent systems capable of supporting multi-day sessions.
1. Introduction
The emergence of Large Language Models (LLMs) with tool-calling capabilities has fundamentally inverted the traditional relationship between code and artificial intelligence. In conventional software architectures, code orchestrates external services and APIs, maintaining control over execution flow and decision-making. Agent-based systems reverse this paradigm entirely: the LLM becomes the orchestrator, dynamically invoking code execution based on conversational context and reasoning processes. This architectural inversion creates unprecedented requirements for durability, state management, and session persistence.
For thirty years, backend infrastructure has been dominated by stateless compute models. From the Common Gateway Interface (CGI) introduced in 1993 through modern serverless platforms, the shared-nothing architecture has proven remarkably resilient and scalable. In this paradigm, compute layers maintain no meaningful state between requests; all persistence occurs in databases, with individual request handlers spawning, executing, and terminating without memory of prior interactions. This model has successfully powered the LAMP stack, Ruby on Rails, Node.js applications, and contemporary serverless functions.
AI agents fundamentally violate these stateless assumptions. An agent represents not a transaction but a session—an ongoing interaction that persists as long as the user requires, potentially spanning multiple days. Current data suggests that agent meaningful work duration is doubling every 4-7 months, with contemporary agents capable of sustaining work sessions lasting several hours and projections indicating multi-day sessions in the near future. This paper examines why traditional durability mechanisms fail for agent workloads and proposes a first-principles approach combining context logging with execution snapshots to enable truly durable agent infrastructure.
2. Background and Related Work
2.1 Evolution of Backend Architecture
The CGI specification established the foundational pattern for web computing: an HTTP request triggers a process fork, the process executes work and writes output to stdout, then terminates. This stateless model proved sufficiently effective that it persisted through multiple technological generations. The LAMP stack (Linux, Apache, MySQL, PHP) introduced process reuse for performance optimization but maintained strict statelessness principles. Ruby on Rails, Node.js, and contemporary serverless platforms all implement variations of the shared-nothing architecture, where compute layers remain ephemeral and databases provide the sole source of persistent state.
Approximately 10-15 years ago, workflow engines introduced the replay model to handle multi-step operations with side effects. This approach wraps each side effect in a cached step; upon resumption after interruption, the system skips completed steps and executes only remaining work. The replay model provides valuable properties including an execution history audit trail and the ability to pause workflows pending external events. However, this model imposes significant constraints: code must adhere to rigid structural requirements, deterministic behavior must be enforced outside cached steps, and replay journal versioning across code deployments introduces substantial complexity.
3. Core Analysis
3.1 Fundamental Limitations of Replay for Agent Loops
The application of replay models to agent architectures reveals critical scalability failures. In an agent loop, every LLM call and tool invocation becomes a distinct step requiring logging and potential replay. This creates exponentially growing replay logs as agent interactions accumulate. The replay model encounters fundamental limits manifesting in two distinct failure modes: excessive entry counts within replay journals and oversized individual entries that exceed storage or processing capabilities.
The mathematical progression proves unsustainable. A typical agent session involving iterative problem-solving may execute dozens of tool calls, each requiring context from previous interactions. As sessions extend from minutes to hours to days, the replay log grows proportionally while the computational cost of replay operations increases super-linearly. Furthermore, the replay model assumes bounded workflows—transactions with defined endpoints—whereas agents operate as open-ended sessions lasting as long as user engagement continues.
3.2 First-Principles Decomposition of Agent State
A rigorous analysis reveals that agent systems possess two conceptually and technically separable components: context and execution state. The context comprises an append-only log of all interactions including system messages, user messages, tool calls, tool results, and assistant responses. This context represents the conversational and reasoning history that defines the agent's understanding of its task and progress.
Conversely, execution state encompasses the runtime environment: cloned repositories, installed packages, in-memory datasets, running development servers, active sandboxes, and file system modifications. Critically, these two components exhibit fundamentally different durability characteristics. Context, being append-only and serializable, can be made durable using conventional storage mechanisms including databases, object storage systems, or distributed file systems. This durability enables recovery across code version changes, machine crashes, and various failure modes.
Execution state, however, cannot be faithfully recreated from logs alone. The computational cost of rebuilding complex execution environments from initialization through all intermediate states proves prohibitive. Moreover, certain runtime states—particularly those involving external system interactions, network connections, or temporal dependencies—cannot be deterministically reconstructed.
3.3 Snapshot and Restore Architecture
The solution to execution state durability requires abandoning log-based reconstruction in favor of snapshot and restore mechanisms. This approach captures the complete machine state when the agent reaches a stable waiting point (typically awaiting the next user message), persists this snapshot, and restores it upon resumption. This strategy preserves all execution state components while avoiding the prohibitive cost of maintaining continuously running machines during idle periods.
The snapshot-restore pattern provides distinct recovery paths for different failure modes. LLM failures—such as rate limiting, temporary service unavailability, or quality issues—can be addressed by snapshotting the current state and retrying later without losing execution context. Machine crashes or infrastructure failures can be recovered through restoration from the durable context log, potentially sacrificing some execution state but maintaining conversational continuity.
3.4 Implementation Through Firecracker Micro VMs
The technical realization of snapshot-restore functionality evolved through several generations of tooling. CRIU (Checkpoint/Restore In Userspace), introduced in 2011, provides user-space process suspension and restoration through parasite injection techniques. CRIU operates transparently to the target process and maintains compatibility with container environments. However, CRIU exhibits critical limitations: it checkpoints only single processes rather than entire execution environments, misses files not open at snapshot time, and demonstrates poor performance when integrated with container registries.
Firecracker micro VMs, enhanced with snapshot capabilities in 2024, address these limitations by capturing entire machine states including all processes, files, and network state. Naive Firecracker snapshots incur substantial storage costs—a 512MB default memory allocation produces 512MB snapshot sizes. This challenge is resolved through seekable compression, which reduces snapshots to approximately 14MB compressed while enabling on-demand decompression of only required memory pages during restoration.
4. Technical Insights
The implementation achieves remarkable performance characteristics: snapshot operations complete in under one second, restoration requires approximately 200 milliseconds, and the system supports 15,000 VM starts per minute (roughly equivalent to 30 frames per second in video terminology). These metrics demonstrate production-grade performance suitable for interactive agent applications.
The seekable compression mechanism represents a critical innovation. Rather than requiring full decompression of the entire 512MB snapshot before restoration can proceed, the system decompresses only the specific memory pages accessed during execution. This lazy decompression strategy dramatically reduces both restoration latency and memory bandwidth requirements, enabling the observed 200-millisecond restoration times despite working with substantially compressed data.
The architecture provides a tunable compression parameter allowing operators to balance snapshot size against performance requirements. Higher compression ratios reduce storage costs and network transfer times but may increase page fault latency during execution. This flexibility enables optimization for diverse workload characteristics and infrastructure constraints.
The resulting system, FC Run, provides a Docker CLI drop-in replacement for running containers in Firecracker VMs with integrated snapshot-restore capability. This design choice minimizes adoption friction by maintaining compatibility with existing container-based workflows while providing the enhanced durability properties required for agent workloads.
5. Discussion
The transition from stateless to stateful compute infrastructure represents the most significant architectural shift in backend systems since the introduction of CGI in 1993. This transformation extends beyond mere technical implementation details to fundamentally reshape how developers conceptualize, build, and operate backend services. The dual-architecture approach—combining append-only context logs with execution snapshots—provides a theoretically sound and practically viable foundation for durable agent systems.
Several areas warrant further investigation. The interaction between snapshot frequency and system performance remains inadequately characterized; optimal snapshotting strategies likely vary based on agent workload characteristics, user interaction patterns, and infrastructure costs. The versioning and migration challenges associated with restoring snapshots across code deployments require additional research, particularly as agent codebases evolve while long-running sessions remain active. Furthermore, the security implications of persistent execution state, including potential information leakage across snapshot boundaries and the attack surface introduced by snapshot storage, demand rigorous analysis.
The broader industry trend toward longer-duration agent sessions—with meaningful work duration doubling every 4-7 months—suggests that durability mechanisms will become increasingly critical to agent viability. Organizations building agent systems must consider durability requirements from initial architecture decisions rather than attempting to retrofit stateful capabilities onto stateless foundations. The open-source release of FC Run provides a concrete implementation reference for practitioners developing production agent systems.
6. Conclusion
This analysis demonstrates that durable AI agents require a fundamental rethinking of backend infrastructure, moving from three decades of stateless compute to stateful session management. The replay model, while valuable for bounded workflows, fails catastrophically when applied to open-ended agent loops due to exponential log growth and fundamental scalability limits. The proposed dual-architecture solution—append-only context logging for conversational history combined with Firecracker-based snapshot-restore for execution state—provides both theoretical soundness and practical performance.
The technical achievements are substantial: sub-second snapshot times, 200-millisecond restoration latency, and 97% storage reduction through seekable compression. These metrics demonstrate production viability for interactive agent applications requiring multi-hour or multi-day session persistence. As agent capabilities continue advancing and session durations continue extending, the architectural patterns and implementation strategies presented here provide a foundational framework for building the next generation of stateful backend infrastructure. Organizations developing agent systems should prioritize durability as a first-class architectural concern, recognizing that the stateless assumptions underlying modern web infrastructure no longer apply to agent workloads.
Sources
- Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, CEO, Trigger.dev - Original Creator (YouTube)
- Analysis and summary by Sean Weldon using AI-assisted research tools
About the Author
Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.