Your Agent Failed in Prod. Good Luck Reproducing It. - Tisha Chawla & Susheem Koul, Microsoft

Production AI agents require replayability and observability - not bitwise determinism - to debug failures and ensure reliability. By recording execution traces ...

2026-07-02 By Sean Weldon

Abstract

Production AI agents exhibit non-deterministic failures that fundamentally challenge traditional debugging methodologies, as failures occurring in live systems cannot be reproduced in development environments despite identical prompts and model configurations. This analysis examines why temperature-zero configurations and other determinism-seeking approaches fail to address this reproducibility gap, demonstrating that GPU non-determinism, batch invariance, and mixture-of-experts architectures render bitwise determinism unattainable through hosted APIs. The research presents a replayability-based observability framework that records execution traces at system boundaries rather than pursuing deterministic model outputs. The Chronicle framework exemplifies this approach, enabling deterministic replay of production failures without incurring model API costs while preserving stochastic properties essential for agent performance. These findings have immediate implications for production AI reliability engineering, debugging workflows, and continuous integration practices in agent-based systems.

1. Introduction

The deployment of Large Language Model (LLM)-based agents in production environments has exposed a critical engineering challenge: failures that manifest in live systems frequently resist reproduction in development environments, even when developers replicate prompts, model versions, and configuration parameters precisely. This reproducibility gap violates a fundamental principle of software engineering - that effective debugging requires the ability to observe, manipulate, and systematically analyze failure conditions.

The practical consequences of this limitation extend beyond engineering inconvenience to create significant operational and financial risks. A documented production incident illustrates the severity: an agent misinterpreted an instruction to sell "$1,000 worth of stock" as a command to sell 1,000 shares, executing a transaction that resulted in a $190,000 loss. The system returned a standard HTTP 200 success status in 30 milliseconds with zero exceptions, providing no programmatic indication of the catastrophic logical error. Such failures generate both immediate customer liability and long-term technical debt when teams cannot reliably debug root causes or verify that fixes prevent recurrence.

This synthesis examines why conventional approaches to ensuring determinism fail for production agents and proposes an alternative paradigm centered on replayability rather than bitwise determinism. The analysis establishes that recording execution traces at system boundaries - capturing inputs and outputs of discrete reasoning steps - enables effective debugging without requiring deterministic model outputs. This approach preserves the stochastic properties that enable agent creativity and problem-solving capabilities while providing the observability necessary for production reliability. The central thesis posits that the engineering community has pursued the wrong objective: rather than asking "How do I make the model deterministic?" the critical question is "How do I debug and retest a run I cannot reproduce?"

2. Background and Related Work

Traditional software debugging operates on the principle of deterministic execution: given identical inputs and system states, programs produce identical outputs. This property enables developers to reproduce failures locally, identify root causes through controlled experimentation, and verify fixes through regression testing. The debugging workflow assumes that capturing sufficient input context allows faithful recreation of any execution path. However, LLM-based systems introduce fundamental non-determinism at multiple architectural layers that challenge this assumption.

The Mixture of Experts (MOE) architecture exemplifies structural sources of non-determinism in modern language models. MOE models route tokens to specialized sub-networks through learned gating functions, with strict capacity constraints that trigger token rerouting when individual experts reach saturation thresholds. This routing mechanism depends on concurrent batch composition - the other requests processed simultaneously - rather than solely on input content. Consequently, identical prompts submitted at different times encounter different expert availability patterns, producing divergent outputs even with temperature set to zero.

Observability frameworks in traditional distributed systems focus on tracing request flows through microservices, capturing timing information, dependency graphs, and error states. These approaches typically operate at the network layer, recording HTTP transactions and packet-level data. For AI agents, however, the semantically meaningful unit of observation differs fundamentally. The critical information resides not in network packets but in logical state transitions - the inputs and outputs of tool calls, retrieval operations, and reasoning steps that constitute the agent's decision-making process. This distinction necessitates observability approaches that capture semantic content at system boundaries rather than network-layer telemetry.

3. Core Analysis

3.1 The Insufficiency of Temperature-Zero Configurations

A prevalent misconception holds that setting sampling temperature to zero ensures deterministic model outputs, providing a straightforward solution to reproducibility challenges. Empirical evidence systematically contradicts this assumption. Running identical prompts with temperature zero across 1,000 iterations can produce dozens of distinct responses, demonstrating that sampling determinism does not guarantee system-level determinism.

Four architectural factors explain this phenomenon. First, sampling determinism represents only one component of a complex system; deterministic token selection does not eliminate non-determinism in other system layers. Second, floating-point arithmetic exhibits non-associativity: the order of decimal addition affects final results, and minor variations in matrix operation sequencing alter logit values sufficiently to flip winning tokens. Third, batch invariance constitutes the primary culprit - requests grouped with concurrent traffic experience logit shifts from shared computational resources, changing token selection outcomes. Fourth, mixture-of-experts routing imposes capacity limits where token assignment depends entirely on batched traffic patterns rather than input content alone.

Furthermore, temperature-zero configurations fail to address the underlying problem: flawed reasoning paths. Setting temperature to zero does not repair broken logic; it merely causes the model to execute the same erroneous reasoning deterministically. The $190,000 stock trading failure would have occurred identically with temperature zero, as the fundamental error lay in the agent's interpretation of the quantity field, not in stochastic variation across multiple runs.

3.2 Replayability as an Alternative Paradigm

The distinction between bitwise determinism and replayability clarifies the appropriate engineering objective. Bitwise determinism - the property that identical inputs produce identical outputs - represents controllability. This property remains unattainable from hosted APIs due to the architectural factors described above, and proves undesirable because stochastic sampling enables model creativity and problem-solving flexibility. Replayability, conversely, enables re-validation of a specific run that already occurred, providing sufficient capability for debugging without requiring future runs to replicate past outputs exactly.

The replayability paradigm shifts focus from controlling model outputs to recording execution traces at system boundaries. Rather than capturing network-layer telemetry, this approach records what enters and leaves each node in the agent's execution graph - the semantic content of tool calls, LLM responses, and retrieval results. Recording at boundaries captures the meaning of each step rather than the bytes transmitted, preserving the information necessary to understand decision paths and identify failure points.

This boundary-based recording enables a critical capability: deterministic replay of production failures without incurring model API costs. By stubbing LLM nodes with their recorded outputs, teams can rerun entire agent executions while testing changes to deterministic components such as guardrails, tool implementations, and control flow logic. The system executes identical state transitions without regenerating probabilistic model responses, converting production failures into reusable test cases.

3.3 The Chronicle Framework Architecture

The Chronicle framework implements replayability through a boundary annotation pattern that wraps methods requiring observability. Developers annotate tool calls, LLM invocations, and RAG retrievals, instructing Chronicle to record all inputs and outputs transparently. This annotation freezes the complete execution state during agent runs, capturing model versions, code versions, sampling parameters, and the full input-output envelope for each operation.

Chronicle generates traces - hyperdetailed JSON metadata structures documenting each node's execution context. These traces include model version identifiers, sampling configuration, input parameters, and output values. The metadata provides the complete information necessary to understand why specific decisions occurred and to replay executions deterministically. Importantly, traces capture not merely prompts but the full envelope of factors influencing responses: retrieval context, conversation history, tool results, and environmental state.

The framework enables two distinct testing modalities. Deterministic testing applies to non-probabilistic nodes such as guardrails and tool calls, using Chronicle to stub LLM outputs and freeze context. These tests execute without model calls, providing fast, cost-free validation that code changes preserve intended behavior. Behavioral testing addresses subjective qualities such as tone and trajectory correctness, employing techniques like LLM-as-judge to evaluate whether responses meet quality criteria. This dual testing approach accommodates both the deterministic and stochastic components of agent systems.

4. Technical Insights

The implementation of replayability-based debugging requires careful attention to what constitutes a complete execution trace. Session variables - LLM version identifiers, build IDs, RAG chunk selections - must be logged explicitly rather than assumed recoverable from prompts alone. The $190,000 trading failure demonstrates why prompt-only logging proves insufficient: the prompt appeared correct, but the agent's interpretation within its broader context produced catastrophic misunderstanding. Complete traces capture this broader context, enabling post-hoc analysis of reasoning paths.

Batch invariance emerges as the primary technical obstacle to temperature-zero determinism. When API providers batch requests for computational efficiency, concurrent traffic patterns influence individual request processing through shared resource allocation and expert routing. This dependency means that identical requests submitted at different times encounter different computational contexts, producing divergent outputs. Recording traces at system boundaries circumvents this limitation by capturing actual outputs rather than attempting to recreate computational conditions.

The workflow for replay-based debugging follows a systematic progression: annotation of system boundaries, recording of production traces, visualization of execution graphs, analysis to identify failure points, code fixes to address root causes, replay of traces to verify fixes, and promotion of traces to regression tests. This workflow transforms each production failure into a permanent test case, gradually building comprehensive test coverage from real-world execution patterns rather than synthetic scenarios.

A critical architectural decision involves recording granularity. Network-layer recording captures packets and HTTP transactions but obscures semantic meaning. Boundary-based recording captures logical operations - what the agent decided and why - providing the information necessary for reasoning about correctness. This semantic granularity enables debugging questions such as "Why did the agent interpret $1,000 as a quantity rather than calculating share count?" that network traces cannot address.

5. Discussion

The replayability paradigm represents a fundamental shift in how production AI systems approach reliability engineering. Rather than attempting to eliminate non-determinism through architectural constraints or configuration parameters, this approach accepts non-determinism as inherent to LLM-based systems and builds observability infrastructure to work within that reality. This philosophical shift aligns with broader trends in distributed systems engineering, where observability-driven development has superseded determinism-seeking approaches for complex, emergent systems.

The implications extend to continuous integration and deployment practices. Traditional CI/CD assumes that tests execute in controlled, reproducible environments where failures indicate code regressions. Agent systems violate this assumption when test failures result from stochastic model variations rather than code changes. Replay-based testing resolves this tension by separating deterministic component testing (which should never fail due to randomness) from behavioral testing (which explicitly measures stochastic properties). This separation enables reliable regression detection while preserving model creativity.

Several areas warrant further investigation. The optimal granularity for boundary recording remains an open question - finer granularity provides more debugging information but increases storage costs and complexity. The relationship between trace-based testing and traditional property-based testing deserves exploration, as recorded traces represent specific execution paths while property-based tests verify invariants across input spaces. Additionally, techniques for trace anonymization and privacy preservation require development before widespread adoption in domains handling sensitive data.

The preservation of generation-time variation emerges as a critical principle. The instinct to pin temperature to zero for reproducibility proves counterproductive, as it eliminates the stochastic properties that enable agents to explore solution spaces creatively. Replayability-based approaches preserve this variation while providing the debugging capabilities necessary for production reliability. This balance - maintaining creative flexibility while ensuring debuggability - represents the core contribution of the replayability paradigm.

6. Conclusion

This analysis establishes that the pursuit of bitwise determinism in production AI agents represents a misalignment between engineering objectives and system capabilities. GPU non-determinism, batch invariance, and mixture-of-experts architectures render true determinism unattainable through hosted APIs, while temperature-zero configurations fail to address underlying reasoning errors and eliminate beneficial stochastic properties. The replayability paradigm offers a viable alternative, enabling effective debugging through boundary-based trace recording without requiring deterministic model outputs.

The Chronicle framework demonstrates practical implementation of these principles, converting production failures into deterministic test cases while preserving the randomness essential for agent performance. The key insight - that debugging requires the ability to re-validate specific runs rather than reproduce arbitrary future runs - reframes the reproducibility challenge in tractable terms. Organizations deploying production agents should prioritize comprehensive boundary logging, explicit session variable capture, and replay-based testing infrastructure over futile attempts to force determinism through API configuration.

Future work should focus on standardizing trace formats for interoperability, developing privacy-preserving trace analysis techniques, and establishing best practices for balancing trace granularity against storage costs. As agent deployments scale, the replayability paradigm provides a foundation for production reliability that acknowledges rather than fights the fundamental nature of LLM-based systems.

Sources

Your Agent Failed in Prod. Good Luck Reproducing It. - Tisha Chawla & Susheem Koul, Microsoft - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub