How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

Agent observability differs fundamentally from traditional observability in scope, data characteristics, and required expertise; it must measure both technic...

2026-06-01 By Sean Weldon

Abstract

Agent observability represents a fundamental departure from traditional observability paradigms, necessitating novel architectural approaches and organizational structures to measure both technical performance and qualitative behavior in non-deterministic systems. While traditional observability focuses on uptime metrics and deterministic control flows using established tools, agent systems demand measurement of AI-specific quality dimensions across semi-structured traces that can exceed 1GB in size. This analysis examines the architectural, methodological, and organizational distinctions that characterize agent observability, including custom database designs employing write-ahead logs and full-text indexing frameworks, the convergence of observability and evaluation workflows, and the essential role of non-technical domain experts in assessment processes. The findings demonstrate that effective agent observability requires purpose-built infrastructure capable of supporting simultaneous real-time and batch query patterns, unified evaluation frameworks that bridge production monitoring and offline experimentation, and cross-functional participation spanning technical and domain expertise.

1. Introduction

The proliferation of Large Language Model (LLM)-based agent systems has exposed fundamental limitations in traditional observability frameworks. Traditional observability—the practice of measuring system health through metrics, logs, and traces—was designed for deterministic applications with known control flows and constrained operational parameters. These systems prioritized uptime monitoring and technical performance measurement through established tools such as Grafana and Datadog, serving primarily technical personas focused on maintaining service availability.

However, the non-deterministic nature of agent systems, characterized by high behavioral variety and qualitative outcomes, demands a reconceptualization of observability itself. As Hetzel articulates, "Agents are non-deterministic, whereas applications are deterministic. The reason why we love LLMs so much is because they have high variety." This fundamental property introduces measurement challenges that extend beyond traditional metrics to encompass contextual grounding, tool usage appropriateness, and domain-specific compliance standards.

This synthesis examines the structural differences between traditional and agent observability across three critical dimensions: scope and metrics, technical architecture, and organizational participation. The central thesis posits that agent observability constitutes a distinct problem domain requiring specialized infrastructure, unified evaluation frameworks, and cross-functional expertise that bridges technical and domain knowledge. The analysis proceeds by establishing traditional observability foundations, delineating agent-specific requirements, examining architectural solutions implemented in production systems, and exploring the methodological convergence of observability with evaluation practices.

2. Background and Related Work

Traditional observability emerged from distributed systems engineering to monitor uptime and technical performance through three foundational pillars: metrics (quantitative measurements), logs (discrete event records), and traces (complete workflow interactions composed of individual spans representing discrete steps). This paradigm assumes deterministic behavior with predictable control flows, enabling engineers to establish known metrics and thresholds for system health monitoring.

The distributed tracing model provides the conceptual foundation subsequently adapted for agent systems. In traditional implementations, traces capture end-to-end request flows while spans represent individual operations within those flows. However, these frameworks were optimized for structured data with predictable sizes and query patterns, serving technical personas such as systems engineers and product engineers focused on operational reliability.

The agent paradigm introduces fundamental challenges to this model. Agent systems built on LLMs generate value precisely through behavioral variety—the capacity to produce diverse, contextually appropriate responses rather than executing predetermined logic paths. This non-determinism, combined with the semi-structured nature of agent interactions and the necessity of measuring qualitative outcomes, renders traditional observability tools insufficient for comprehensive agent system monitoring and improvement.

3. Core Analysis

3.1 Scope Expansion and Metric Heterogeneity

Agent observability extends beyond traditional uptime metrics to encompass both technical performance indicators and AI-specific quality dimensions. Traditional metrics such as latency, error rates, and operational status remain relevant but prove insufficient for comprehensive agent assessment. Agent systems require additional technical metrics including time to first token, total token consumption, and duration—measurements specific to LLM-based architectures. Furthermore, cache hit rates emerge as automatically captured technical observability metrics when tracing agent applications, providing insights into efficiency optimization opportunities.

More significantly, agent observability must address qualitative dimensions absent from traditional frameworks. These include contextual grounding (whether agent responses appropriately reference provided context), tool usage alignment (whether agents select and apply tools correctly), and domain-specific compliance standards such as brand voice consistency or regulatory adherence. As Hetzel notes, "The scope of traditional observability is actually quite different from the scope of agent observability," reflecting the necessity of measuring outcomes that resist simple quantification.

This metric heterogeneity necessitates assessment methodologies that combine automated technical measurement with human judgment. Domain experts—including clinicians, nurses, wealth advisors, and lawyers—provide evaluations grounded in specialized knowledge that technical metrics cannot capture. Their participation represents a fundamental departure from traditional observability, where involvement was restricted to technical roles focused on system uptime.

3.2 Architectural Requirements and Data Characteristics

Agent traces exhibit data characteristics that fundamentally challenge traditional observability infrastructure. These traces are described as "really nasty"—highly semi-structured with embedded unstructured text data, capable of exceeding 1GB in total size with individual spans reaching 20MB. This contrasts sharply with traditional traces, which typically contain structured data with predictable sizes and schemas.

The semi-structured nature of agent traces, combining structured metadata with extensive unstructured text content, requires architectural approaches that support both traditional analytical queries and full-text search capabilities. Traditional OLAP tools such as ClickHouse, while effective for structured metric analysis, lack sufficient text-based indexing capabilities for agent trace workloads. This limitation prompted Braintrust to design a custom database specifically for agent traces rather than adapting existing observability infrastructure.

Furthermore, agent observability must support dual access patterns simultaneously: real-time read/ingest workflows common in operational monitoring, and batch SQL query patterns necessary for automated improvement workflows. This requirement for simultaneous support of online and offline analysis patterns distinguishes agent observability from traditional approaches optimized primarily for real-time alerting and dashboard visualization.

3.3 Custom Database Architecture for Agent Traces

The architectural solution implemented by Braintrust demonstrates the infrastructure requirements for production agent observability. The system employs three key components: a write-ahead log for immediate trace visibility, a Tantivy index for full-text search, and a unified query interface supporting SQL or SQL-similar languages.

The write-ahead log enables instant trace visibility upon ingestion, supporting real-time monitoring requirements critical for production agent systems. This component ensures that traces become immediately accessible for inspection and analysis without waiting for batch processing or index updates, facilitating rapid issue identification and debugging workflows.

The Tantivy index—an open-source full-text indexing framework written in Rust, similar to Apache Lucene—addresses the text search problem central to agent observability. Traditional observability tools lack mechanisms for searching and analyzing the extensive unstructured text embedded within agent traces. Full-text indexing across trace data enables filtering, pattern identification, and anomaly detection within conversational content, tool invocations, and reasoning chains that characterize agent behavior.

The unified query interface provides a consistent access mechanism across diverse data shapes and access patterns. By supporting SQL or SQL-similar query languages, the architecture enables both technical users executing analytical queries and automated systems performing programmatic trace analysis to interact with agent observability data through familiar interfaces.

3.4 Convergence of Observability and Evaluation

A critical insight emerging from agent observability practice is the fundamental unity of observability and evaluation workflows. As Hetzel articulates, "Observability and evals are the same problem with one key difference: evals run in batch with known inputs ahead of time, observability runs in real-time with unknown inputs." This recognition enables unified infrastructure and methodologies across production monitoring and offline experimentation.

Human annotation serves as the foundational methodology bridging these workflows. Domain experts grade agent outputs and, critically, justify their assessments. These justifications provide the semantic understanding necessary to develop scalable scoring functions. LLM-based scoring systems trained on human annotations and justifications can identify failure modes and implement automated evaluation at scale, enabling continuous assessment of agent behavior in production.

The bidirectional relationship between observability and evaluation creates a continuous improvement cycle. Traces collected in production can be added to offline datasets for experimentation, enabling teams to reproduce and address issues identified through operational monitoring. Conversely, evaluation methodologies developed offline can be deployed as automated scoring functions in production observability pipelines, providing real-time quality assessment.

3.5 Automated Insight Derivation Through Topic Modeling

Recent developments in agent observability extend beyond known metrics to automated discovery of unknown patterns and issues. Braintrust deployed lightweight LLM-based analysis performing embedding and clustering on agent traces approximately one month before the presentation, enabling automated topic modeling for user intent extraction, sentiment analysis, and issue identification.

This capability represents a distinction between two forms of production analysis: online scoring of "known unknowns" with quantified scores, and open-ended insight derivation addressing "unknown unknowns." Traditional observability focuses exclusively on the former—monitoring predefined metrics against established thresholds. Agent observability additionally requires mechanisms to identify emerging patterns, novel failure modes, and unexpected user behaviors not anticipated during system design.

Automated topic modeling accelerates the iteration loop between identifying production problems and testing fixes through experimentation. By surfacing clusters of related issues or user intents, these systems enable teams to prioritize improvements based on empirical production data rather than assumptions about agent behavior.

4. Technical Insights

The architectural requirements for agent observability yield several actionable technical findings for practitioners implementing production agent systems. First, the scale characteristics of agent traces—exceeding 1GB total size with individual spans reaching 20MB—necessitate infrastructure designed specifically for these workloads rather than adaptation of traditional observability tools. Organizations should anticipate that general-purpose OLAP databases will prove insufficient without robust full-text indexing capabilities.

Second, the dual access pattern requirement—supporting both real-time operational monitoring and batch analytical queries—demands architectural patterns such as write-ahead logs combined with asynchronous indexing. This approach enables immediate trace visibility for debugging while maintaining query performance for analytical workloads. Implementation considerations include managing index lag, ensuring consistency between real-time and indexed views, and optimizing for the specific query patterns common in agent observability (full-text search, filtering on semi-structured fields, aggregation across qualitative assessments).

Third, the convergence of observability and evaluation workflows suggests that infrastructure investments should support both use cases from inception. Systems designed exclusively for production monitoring or offline evaluation will require costly refactoring to support unified workflows. Key capabilities include trace export and import mechanisms, consistent data models across environments, and evaluation frameworks that can execute both synchronously (for offline batch evaluation) and asynchronously (for production scoring).

Fourth, the integration of LLM-based analysis for automated insight derivation introduces computational and cost considerations. Lightweight embedding and clustering approaches balance insight generation with resource consumption, but organizations must architect for the computational overhead of performing LLM inference on production trace volumes. Trade-offs include sampling strategies, asynchronous processing pipelines, and cost-aware model selection.

5. Discussion

The findings presented demonstrate that agent observability constitutes a distinct problem domain requiring specialized approaches across infrastructure, methodology, and organization. The architectural requirements—particularly the need for full-text indexing, dual access patterns, and large-scale semi-structured data handling—explain why traditional observability tools prove insufficient for agent systems. Organizations investing in agent development should anticipate building or adopting purpose-built observability infrastructure rather than extending existing monitoring solutions.

The convergence of observability and evaluation represents a methodological insight with significant practical implications. By recognizing these as unified problems distinguished primarily by timing (real-time versus batch) rather than fundamental approach, teams can develop integrated workflows that accelerate iteration cycles. The bidirectional flow between production traces and offline datasets enables empirical, data-driven improvement processes grounded in actual agent behavior rather than synthetic test cases alone.

Furthermore, the essential role of non-technical domain experts in agent observability signals an organizational shift. As Hetzel observes, "The best teams that are building agents have both technical and non-technical people in the fold performing this work because it's the non-technical people that are either closest to the users or have knowledge closest to the problem space." This cross-functional requirement distinguishes agent development from traditional software engineering, where observability remained primarily a technical discipline. Organizations must develop processes and tools that enable domain experts to contribute effectively despite lacking traditional engineering backgrounds.

Several areas warrant further investigation. The scalability limits of full-text indexing approaches for agent traces remain unclear as trace volumes increase. The optimal balance between automated scoring and human annotation requires empirical study across different domains and use cases. Additionally, the effectiveness of automated topic modeling for identifying actionable insights versus generating noise requires systematic evaluation.

6. Conclusion

This analysis establishes that agent observability represents a fundamental departure from traditional observability paradigms, requiring specialized infrastructure, unified evaluation methodologies, and cross-functional organizational structures. The key contributions include: (1) delineation of scope expansion from technical metrics to qualitative assessment dimensions; (2) identification of architectural requirements including full-text indexing, write-ahead logs, and dual access pattern support; (3) recognition of observability-evaluation convergence as a unified problem distinguished by timing; and (4) articulation of the essential role of non-technical domain expertise.

Practical takeaways for organizations developing agent systems include the necessity of purpose-built observability infrastructure, the value of unified workflows spanning production monitoring and offline experimentation, and the importance of enabling cross-functional participation in observability processes. As agent systems proliferate across domains requiring specialized knowledge—healthcare, finance, legal services—the organizational capability to integrate technical and domain expertise in observability workflows will increasingly differentiate successful implementations. Future work should focus on scaling these approaches, developing standardized frameworks for agent observability, and empirically validating the effectiveness of automated insight derivation mechanisms across diverse application domains.

Sources

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub