'Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI'

Building production-grade voice agents requires solving latency, intelligence, voice quality, and reliability simultaneously through a pipeline architecture,...

By Sean Weldon

Abstract

Production-grade voice agents present a multifaceted engineering challenge requiring simultaneous optimization of latency, conversational intelligence, voice naturalness, and system reliability. This analysis examines the technical architecture and implementation considerations for voice AI systems capable of real-time conversational interactions at scale, serving billions of annual customer interactions currently handled by human operators. The investigation reveals that contemporary production systems predominantly employ pipeline architectures—cascading speech-to-text, large language models, and text-to-speech components—while emerging speech-to-speech models offer promising alternatives with current limitations in instruction following and tool calling. Critical findings indicate that human-acceptable response latency thresholds of 300-500 milliseconds necessitate careful optimization across all system components, with LLMs dominating both latency and computational cost budgets. The work demonstrates that voice agent development has transitioned from a research problem to an engineering challenge with immediate practical applications in customer support, appointment scheduling, and transactional services.

1. Introduction

Voice-based interfaces represent a fundamental modality in human-computer interaction, leveraging humanity's most natural communication mechanism. The commercial imperative for automated voice agents is substantial: billions of phone calls annually remain handled by human operators for customer support, appointment scheduling, and order status inquiries. Unlike text-based systems, voice interfaces align with innate human communication patterns—humans acquire spoken language developmentally before literacy—making them inherently more accessible for broad user populations.

The development of high-quality conversational voice agents has evolved from a research challenge to primarily an engineering problem. This transition reflects maturation in underlying technologies, particularly speech-to-text (STT) systems, large language models (LLMs), and text-to-speech (TTS) synthesis. However, production deployment requires solving multiple technical challenges simultaneously rather than sequentially: real-time latency constraints that mirror human conversational expectations, conversational intelligence capable of handling ambiguous instructions and complex workflows, voice naturalness across linguistic variations including multilingual support and emotional expression, and reliability at scale spanning hundreds to thousands of concurrent interactions.

This synthesis examines the architectural approaches, technical constraints, and evaluation methodologies for production voice agents. The analysis focuses on pipeline architectures currently dominating production deployments while exploring emerging speech-to-speech models that may represent the next evolutionary stage of voice AI. The central thesis posits that successful voice agent deployment requires holistic system optimization rather than component-level excellence, with specific attention to latency budgets, model selection trade-offs, and infrastructure considerations.

2. Background and Related Work

Two primary architectural paradigms have emerged for voice agent systems. The pipeline architecture (also termed cascading architecture) decomposes the problem into sequential components: speech-to-text conversion, language model processing, and text-to-speech synthesis, coordinated by an agent orchestrator. Frameworks such as PipeCAT and LiveKit provide infrastructure for this approach, though many production systems employ proprietary implementations. This architecture enables independent optimization of each component and leverages mature, well-understood models for each transformation stage.

The alternative speech-to-speech model architecture employs a single end-to-end model processing speech input directly to speech output, potentially with function calling capabilities. Examples include OpenAI's Real-Time API and Nvidia Voice Chat. This approach eliminates text intermediaries, theoretically preserving prosodic information such as tone, emotion, and hesitation patterns that are lost in text conversion. Furthermore, speech-to-speech models enable full-duplex communication—simultaneous bidirectional audio transmission—allowing natural conversational phenomena such as backchanneling (acknowledgment tokens like "I see" or "aha") and interruption handling.

Foundational models in the pipeline architecture include Whisper, OpenAI's canonical speech-to-text system trained on 30-second audio clips. This training paradigm necessitates chunking and padding logic for streaming applications, introducing architectural complexity. Nvidia's streaming-native encoder addresses this limitation through variable look-ahead windows (80 milliseconds to 1 second) with activation caching for computational efficiency, representing a significant advancement in streaming-optimized speech recognition.

3. Core Analysis

3.1 Latency Requirements and Human Perception Thresholds

Human conversational dynamics establish stringent latency requirements for voice agents. In natural human dialogue, participants respond to conversational cues within approximately 300 milliseconds. Response delays exceeding 500 milliseconds become perceptually noticeable, while latencies of 1-2 seconds trigger call abandonment behaviors. These thresholds are not arbitrary preferences but reflect fundamental human conversational expectations developed through evolutionary communication patterns.

The latency budget must be distributed across multiple system components. In pipeline architectures, speech-to-text systems achieve state-of-the-art word error rates of approximately 6% on open benchmarks, with P90 latency (90th percentile) of 100 milliseconds for time-to-complete-transcript on optimized implementations. The LLM component requires 200-300 milliseconds for time-to-first-token (TTFT), representing the dominant contributor to overall system latency. Text-to-speech synthesis must maintain a real-time factor (RDF) below 1.0 to avoid buffering—for example, generating 10 seconds of audio in 5 seconds of processing time yields an RDF of 0.5, which is acceptable.

Network latency represents a frequently underestimated contributor to overall system delay. Co-locating models within the same data center reduces network latency from 75 milliseconds to 5 milliseconds, yielding approximately 30% reduction in total latency for already-optimized systems. This finding underscores the importance of infrastructure topology in meeting stringent latency requirements.

3.2 Model Selection and Intelligence Trade-offs

The selection of LLM size presents a critical trade-off between conversational intelligence and latency constraints. Optimal model sizes for production voice agents range from 8-30 billion parameters. Models larger than this range exceed latency budgets despite superior instruction-following capabilities, while smaller models lack sufficient intelligence for reliable tool calling and complex workflow execution. This parameter range represents a practical equilibrium between competing requirements.

Tool calling accuracy emerges as a critical evaluation dimension for production voice agents. Tool call structure—the syntactic correctness of function invocations—should approach 100% accuracy, as structural errors cascade through the system and cannot be recovered. Output feasibility—whether the semantic content of tool calls is appropriate for the conversational context—requires domain-specific evaluation. Production systems frequently employ fine-tuning of models within the 8-30 billion parameter range on use-case-specific data to improve tool calling quality while maintaining latency budgets.

The thinker-talker pattern has emerged as an architectural solution to this trade-off. In this design, a small LLM handles conversational flow with responses such as "let me think" while delegating complex tool calls to larger models with superior instruction-following capabilities and more sophisticated guardrails. This pattern maintains conversational responsiveness while accessing greater intelligence for critical decision points.

3.3 Voice Quality and Linguistic Variation

Voice naturalness encompasses multiple dimensions beyond acoustic fidelity. Production systems must support multiple languages, correct pronunciation of proper nouns (particularly names), emotional expression, and accent variation. The criticality of pronunciation accuracy is particularly acute in domains such as healthcare, where incorrect transcription of drug names can have serious consequences. As noted in the source material, if the transcript is incorrect—whether for a person's name or a pharmaceutical term—there exists no downstream mechanism to correct the error.

Text-to-speech evaluation requires subjective assessment of naturalness, pronunciation accuracy, and emotional control, as these qualities resist purely objective quantification. Metrics such as time-to-first-audio (TTFA) and the real-time factor provide objective latency measurements, but voice quality assessment necessitates human evaluation protocols. This requirement introduces complexity into continuous integration and deployment pipelines, as automated testing cannot fully validate TTS quality.

Turn detection—determining when a speaker has finished speaking versus merely pausing—remains a partially unsolved problem critical to avoiding agent interruptions. Premature turn detection causes the agent to interrupt users, while delayed detection creates awkward silences. This challenge is particularly acute in pipeline architectures where turn detection must occur before LLM processing begins.

3.4 Scale and Reliability Considerations

Reliability at scale introduces distinct challenges beyond single-conversation optimization. Systems must maintain performance characteristics across 100, 1,000, or 10,000 concurrent calls. Auto-scaling presents particular complexity for voice agents due to the stateful nature of long-lived connections. Aggressive scale-up is necessary to prevent request backlog, but scale-down requires careful orchestration: conversations must complete before pod termination, as abrupt disconnection degrades user experience.

Global deployment necessitates positioning models geographically proximate to end users to minimize network latency. Furthermore, data residency requirements in various jurisdictions mandate multi-region deployment capability, adding operational complexity. The latency and cost budgets reveal that LLMs dominate both dimensions, followed by text-to-speech, then speech-to-text. This distribution informs optimization priorities and infrastructure investment decisions.

Guardrailing placement within the pipeline architecture serves dual purposes. Pre-LLM classifiers enable routing decisions, while post-LLM guardrails (positioned before TTS) prevent unauthorized or inappropriate responses from reaching users. This layered approach to safety and policy enforcement reflects the reality that LLMs, despite fine-tuning and prompt engineering, cannot be relied upon exclusively for content moderation.

4. Technical Insights

The transition from pipeline architectures to speech-to-speech models presents both opportunities and current limitations. Speech-to-speech models eliminate text intermediaries and preserve speech nuances such as tone, hesitation, and emotional content. They enable full-duplex communication, allowing natural conversational phenomena including backchanneling and sophisticated interruption handling. However, contemporary speech-to-speech models struggle with instruction following and tool calling, requiring extensive prompt engineering and often failing to match pipeline architecture performance on these dimensions.

Production systems employing speech-to-speech models benefit from parallel transcription models for auditability and observability. Even when the primary inference path bypasses text representation, running a separate transcription model enables visibility into input and output audio for debugging, compliance, and quality assurance purposes. This approach represents a pragmatic compromise between the theoretical elegance of pure speech-to-speech processing and operational requirements.

Evaluation methodologies must evolve for speech-to-speech systems. Traditional component-level evaluations (STT word error rate, LLM perplexity, TTS mean opinion scores) become less relevant. Instead, evaluation shifts toward full-duplex conversation assessments examining entire multi-turn interactions. These conversation-level evaluations focus on emergent properties such as interruption handling, emotional appropriateness, and conversational coherence that cannot be assessed through component-level metrics.

The streaming-native architecture employed by models such as Nvidia's encoder represents a significant advancement over the chunking approach necessitated by Whisper's 30-second training windows. Variable look-ahead windows (80 milliseconds to 1 second) with activation caching enable efficient streaming without the engineering complexity of chunking logic, padding strategies, and the associated latency overhead. This architectural innovation demonstrates the importance of training paradigms aligned with deployment requirements.

5. Discussion

The findings presented in this analysis reveal that voice agent development has fundamentally transitioned from a research challenge to an engineering optimization problem. The maturation of foundational technologies—speech recognition, language models, and speech synthesis—means that production deployment success hinges on system-level integration rather than algorithmic breakthroughs in individual components. This transition has profound implications for organizational capabilities required to deploy voice agents: expertise in distributed systems, infrastructure optimization, and production engineering becomes as critical as machine learning knowledge.

The tension between pipeline architectures and speech-to-speech models reflects a broader pattern in applied machine learning: the trade-off between modular, interpretable systems and end-to-end learned systems. Pipeline architectures offer independent component optimization, mature tooling, and clear debugging surfaces. Speech-to-speech models promise theoretical advantages in preserving information and enabling natural interaction patterns but currently lack the instruction-following and tool-calling capabilities required for production workflows. The resolution of this tension will likely determine the next generation of voice AI architecture.

Several knowledge gaps warrant further investigation. Turn detection remains partially unsolved, particularly in multilingual contexts and for speakers with varied communication styles. The evaluation of conversation-level properties in speech-to-speech systems lacks standardized methodologies and benchmarks. Furthermore, the optimal allocation of computational budgets between inference quality and latency reduction remains context-dependent, suggesting opportunities for adaptive systems that adjust this trade-off based on conversational context.

The practical implications extend beyond technical architecture to business model considerations. The dominance of LLM costs in both latency and computational budgets suggests that voice agent economics will track closely with LLM inference cost trajectories. Innovations in efficient inference—quantization, speculative decoding, mixture-of-experts architectures—will directly impact voice agent viability for cost-sensitive applications.

6. Conclusion

This analysis has demonstrated that production-grade voice agents require holistic optimization across latency, intelligence, voice quality, and reliability dimensions simultaneously. The finding that human conversational expectations impose a 300-500 millisecond response latency threshold establishes a firm constraint that cascades through all architectural decisions. Pipeline architectures currently dominate production deployments due to mature component models and reliable tool-calling capabilities, while speech-to-speech models represent a promising but not yet production-ready alternative.

Key technical contributions include the quantification of latency budgets across system components, the identification of optimal LLM parameter ranges (8-30 billion) for voice applications, and the demonstration that infrastructure topology (co-location) can yield 30% latency reductions in optimized systems. The thinker-talker pattern emerges as a practical architectural solution to the intelligence-latency trade-off, enabling conversational responsiveness while accessing greater reasoning capability for complex decisions.

For practitioners, the primary takeaway is that voice agent deployment success depends on engineering discipline rather than algorithmic innovation. Careful attention to latency budgets, infrastructure optimization, evaluation methodology, and scale considerations determines production viability. As speech-to-speech models mature in instruction-following and tool-calling capabilities, a gradual transition from pipeline to end-to-end architectures appears likely, though this transition will be measured in years rather than months. Organizations investing in voice AI should maintain architectural flexibility to accommodate this evolution while optimizing current pipeline-based systems for immediate production deployment.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub