Why (Senior) Engineers Struggle to Build AI Agents — Philipp Schmid, Google DeepMind

Building agents requires fundamentally different engineering practices than traditional software development, shifting from deterministic control and unit te...

2026-06-03 By Sean Weldon

Engineering Paradigms for Production AI Agents: A Synthesis of Non-Deterministic Software Development Practices

Abstract

The development of Large Language Model (LLM)-powered agents necessitates fundamental reconceptualization of software engineering practices rather than incremental adaptation of existing methodologies. This synthesis examines the transition from deterministic, state-machine-based systems to autonomous agents operating through semantic understanding and probabilistic reasoning. Analysis reveals six critical transformations: semantic text replacing structured data as primary state representation, dynamic intent recognition superseding predetermined workflows, error-as-input patterns enabling recovery in long-running processes, outcome-based evaluation replacing binary correctness assertions, self-documenting API design for agent consumption, and trust-but-verify principles for reliability despite non-determinism. Evidence demonstrates that production agent deployment requires accepting variable execution paths and token consumption across identical inputs while maintaining outcome quality through iterative refinement and pass-rate metrics. These findings have immediate implications for organizations deploying agents in production environments where traditional testing frameworks prove insufficient for validating non-deterministic systems.

1. Introduction

The emergence of Large Language Models as computational primitives has precipitated a fundamental architectural shift in software development. Traditional systems operate through explicit control structures, structured data representations, and deterministic execution paths that enable binary correctness verification. Agent-based architectures, conversely, function through natural language instructions, contextual semantic understanding, and inherently probabilistic outcomes that resist conventional testing methodologies.

This transformation presents significant challenges for engineering organizations accustomed to deterministic software practices. The central tension emerges between the need for production reliability and the acceptance of non-deterministic behavior patterns. Whereas traditional customer support systems employ classification models routing users through predetermined workflows—such as churn detection triggering predefined cancellation flows—agent-based systems must dynamically interpret evolving user intent and adjust behavior accordingly without explicit workflow definitions.

This synthesis examines the engineering practices required for production-grade agent deployment, analyzing six fundamental transformations in software development methodology. The analysis demonstrates that successful agent engineering requires relinquishing deterministic control while maintaining outcome reliability through verification mechanisms, treating failures as recoverable data rather than exceptional states, and designing tools explicitly for agent consumption rather than human developer expertise. The core thesis posits that these changes represent wholesale reconceptualizations of software engineering fundamentals, not incremental improvements to existing practices.

2. Background and Related Work

Traditional software engineering operates on established principles where state representation utilizes data structures mapped to booleans, integers, and predefined enumerations. Control flow follows explicit branching logic, and correctness verification occurs through deterministic unit and integration testing. This paradigm assumes that identical inputs produce identical outputs, enabling assertion-based validation of system behavior.

The iterative prompt refinement loop has emerged as the foundational development pattern for agent systems, contrasting sharply with traditional compile-test-deploy workflows. This pattern consists of defining instructions specifying desired agent behavior, executing the agent, observing outcomes, adjusting prompts or available tools, and repeating the cycle. This introduces continuous adaptation as a core development activity rather than a pre-deployment phase, fundamentally altering the software development lifecycle.

LLM-as-judge evaluation frameworks represent an emerging methodology for assessing subjective agent outputs where traditional assertions prove inadequate. These frameworks employ separate language models to evaluate agent performance against context-dependent criteria, acknowledging that success metrics vary based on use case requirements. The distinction between deterministic traffic control—where engineers specify exact execution steps—and agent-based dispatching—where engineers specify desired outcomes without prescribing implementation paths—captures the fundamental architectural transformation examined in this analysis.

3. Core Analysis

3.1 Semantic State Representation and Contextual Understanding

Traditional software systems represent state through structured data fields mapped to predefined types. Agent-based systems, by contrast, operate on semantic meaning embedded in text and context. This transformation enables capabilities impossible within rigid data structures. For instance, agents can approve a research plan while simultaneously adding constraints such as "focus on US market, ignore California" in a single semantic interaction, whereas traditional systems would require multiple decline-and-replan cycles through predefined workflow states.

The implications extend to personalization and memory management. User preferences such as temperature unit display cannot be effectively mapped to boolean flags when context determines appropriate representation—Celsius for weather forecasts but Fahrenheit for cooking instructions for the same user. Agent systems handle such contextual variations dynamically through semantic understanding rather than exhaustive enumeration of conditional logic.

Furthermore, state representation now encompasses multimodal data including images, video, and audio alongside text. This expansion beyond structured fields necessitates that agents process semantic meaning across modalities rather than relying on predefined schema mappings. The shift from structured to semantic state representation fundamentally alters how systems store, retrieve, and reason about information.

3.2 Dynamic Intent Recognition and Workflow Flexibility

Traditional customer support systems employ classification models that map detected intents to predetermined workflows. A churn detection classification triggers a predefined cancellation flow with fixed decision points. This approach fails when user intent evolves during interaction. Agent-based systems must recognize that offering subscription alternatives may change user intent from cancellation to plan modification, requiring dynamic workflow adjustment rather than rigid path adherence.

The fundamental limitation of predetermined workflows lies in the impossibility of modeling all stateful variations and user uniqueness in advance. Engineers cannot anticipate every conversational trajectory or contextual nuance that influences appropriate system response. Consequently, agent architectures must trust the LLM to understand semantic meaning and offer contextually appropriate alternatives rather than forcing users through predetermined decision trees.

This transformation shifts engineering responsibility from defining explicit control flow to specifying desired outcomes. The traffic controller metaphor—where engineers dictate exact execution steps—gives way to the dispatcher model where engineers communicate destinations without prescribing routes. This requires fundamental reconceptualization of how engineers specify system behavior and evaluate correctness.

3.3 Error Recovery and Resilient Long-Running Processes

Agent systems introduce the error-as-input pattern, treating failures as normal data inputs rather than exceptional states requiring process termination. This approach proves essential for long-running agents executing 5-15 minute workflows where restarting from initial state upon encountering errors wastes accumulated context and computational resources.

The pattern parallels error handling in languages like Go, where function calls return either error or value as equally valid outcomes. When an agent encounters a failure—such as an API timeout or invalid parameter—the error information is provided back to the model along with potential workarounds and additional validation checks. The agent then continues forward in the execution flow rather than restarting entirely.

This approach requires designing systems for recovery rather than assuming error-free execution. Traditional exception handling assumes errors represent unexpected states requiring escalation or termination. Agent systems, conversely, must anticipate that agents will behave unexpectedly and build resilience mechanisms enabling continuation despite failures. The preservation of accumulated context and intermediate results becomes critical for maintaining efficiency in probabilistic execution environments.

3.4 Evaluation Frameworks for Non-Deterministic Systems

Agent non-determinism fundamentally undermines traditional testing methodologies. Identical inputs do not guarantee identical execution steps or outputs, rendering conventional unit and integration tests insufficient for validation. An agent may require four additional research steps for one user while completing the task more efficiently for another, consuming variable token quantities but achieving acceptable outcomes in both cases.

This necessitates measuring success as pass rate across multiple runs rather than binary correctness assertions. Production deployment requires determining acceptable success thresholds—an agent achieving desired outcomes in only one of ten executions proves unsuitable for production regardless of individual success quality. Organizations must establish minimum pass rates balancing reliability requirements against inherent model variability.

Furthermore, evaluation criteria prove inherently subjective and context-dependent. A research report generation agent requires different success metrics than a customer feedback summarization agent. Traditional assertions cannot capture these nuanced quality dimensions. LLM-as-judge frameworks or human expert evaluation become necessary for qualifying feedback, assessing whether outcomes meet context-specific quality standards rather than matching predetermined outputs.

The shift from step-based to outcome-based measurement represents a fundamental change in verification philosophy. Engineers must trace agent behavior and measure success by result quality rather than execution path consistency, accepting that multiple valid execution paths may achieve equivalent outcomes.

4. Technical Insights

Production agent deployment reveals several critical implementation considerations. Deep research agents demonstrate the viability of semantic plan approval, where users can accept a proposed research strategy while adding constraints in natural language—"focus on US market, ignore California"—without requiring structured input forms or multiple interaction cycles. This capability relies on the model's semantic understanding rather than predetermined conditional logic.

Long-running agent processes (5-15 minutes) require explicit error recovery mechanisms to preserve accumulated context and avoid computational waste. Restarting entire workflows upon encountering failures loses both context and compute resources. Implementing error-as-input patterns enables agents to incorporate failure information and continue execution, significantly improving efficiency in production environments.

Agent evaluation infrastructure must support pass-rate calculation across multiple execution runs. Organizations deploying production agents report measuring success rates across 10-50 runs per test case to establish confidence in outcome reliability. This introduces significant testing overhead compared to traditional deterministic assertion-based testing but proves necessary for validating non-deterministic systems.

Function schema documentation must be explicit and self-contained for agent consumption. Traditional API design assumes developer expertise and organizational context—an endpoint named delete_item with an id parameter appears self-explanatory to human developers familiar with the system. Agents, however, only access function schemas and docstrings without broader code context or developer background knowledge. Agent-ready APIs require comprehensive semantic documentation within schema definitions, describing not just parameter types but semantic meaning and usage constraints.

The technical trade-off between agent autonomy and reliability manifests in prompt engineering and tool design decisions. Overly constrained prompts reduce agent flexibility and ability to handle edge cases. Insufficient constraints increase variability and reduce pass rates. Organizations must iteratively refine this balance through empirical evaluation rather than theoretical optimization.

5. Discussion

The findings synthesized in this analysis demonstrate that agent-based software development represents a paradigm shift requiring fundamental reconceptualization of engineering practices. The transition from deterministic control to outcome-based verification, from structured state to semantic understanding, and from exception handling to error-as-input patterns collectively constitute a wholesale transformation in how software systems are designed, implemented, and validated.

Several implications emerge for organizations deploying production agents. First, traditional software development timelines and testing methodologies prove inadequate for agent systems requiring iterative refinement cycles and statistical validation across multiple runs. Organizations must allocate resources for continuous prompt engineering and evaluation infrastructure development. Second, the "build to delete" principle—acknowledging that agent implementations will be rebuilt multiple times as models improve—necessitates architectural decisions prioritizing adaptability over long-term optimization of specific implementations.

Knowledge gaps remain regarding optimal evaluation methodologies for complex multi-step agents, standardization of agent-ready API design patterns, and frameworks for balancing agent autonomy against reliability requirements across different application domains. Furthermore, the relationship between model capability improvements and required prompt complexity remains underexplored, with implications for long-term maintenance burden as foundation models evolve.

The convergence of these findings with broader trends in AI system deployment suggests that the engineering practices outlined here will become increasingly critical as agent-based architectures proliferate across production environments. Organizations that successfully navigate the transition from deterministic to probabilistic software development methodologies will gain significant competitive advantages in deploying reliable autonomous systems.

6. Conclusion

This synthesis has examined the fundamental engineering transformations required for production agent deployment, demonstrating that successful implementation necessitates abandoning deterministic control paradigms in favor of semantic understanding, dynamic intent recognition, and outcome-based evaluation. The core contributions include identifying six critical shifts in engineering practice: semantic state representation, dynamic workflow adaptation, error-as-input recovery patterns, pass-rate evaluation metrics, agent-ready API design, and trust-but-verify reliability principles.

Practical takeaways for engineering organizations include the necessity of implementing iterative refinement workflows, establishing evaluation infrastructure supporting statistical validation across multiple runs, designing self-documenting APIs for agent consumption, and accepting variable execution paths while maintaining outcome quality standards. The recognition that software becomes disposable in rapidly evolving model landscapes should inform architectural decisions prioritizing adaptability.

Future investigation should focus on developing standardized evaluation frameworks for complex agent behaviors, establishing best practices for balancing autonomy and reliability across application domains, and creating tooling that reduces the engineering overhead of iterative prompt refinement. As agent-based architectures become increasingly prevalent in production systems, the engineering practices synthesized in this analysis will prove essential for organizations seeking to deploy reliable autonomous systems despite inherent non-determinism.

Sources

Why (Senior) Engineers Struggle to Build AI Agents — Philipp Schmid, Google DeepMind - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub