Voice AI: when is the "Her" moment? — Neil Zeghidour, CEO, Gradium AI

Voice AI is still far from achieving natural human-like conversation despite recent advances; the path forward requires solving latency, full-duplex capabili...

By Sean Weldon

Abstract

Contemporary voice AI systems remain substantially distant from achieving human-like conversational capabilities despite recent technical advances. This synthesis examines the fundamental architectural, performance, and economic barriers preventing voice AI from reaching the fluidity depicted in aspirational benchmarks such as the film "Her." Through analysis of cascaded systems and speech-to-speech models, this work identifies critical trade-offs: cascaded architectures provide reliability and tool-calling capabilities but suffer from inherent latency constraints exceeding 200 milliseconds, while speech-to-speech models reduce latency but lack production-readiness and full-duplex capabilities necessary for natural conversation. The analysis reveals that achieving human-level voice interaction requires hybrid approaches combining full-duplex conversational models with the intelligence of cascaded systems, alongside on-device processing innovations to address scalability economics. These findings have immediate implications for voice AI system design and commercial deployment strategies.

1. Introduction

The development of natural voice-based human-computer interaction represents a fundamental challenge in artificial intelligence research. While recent advances in speech recognition, natural language processing, and speech synthesis have produced systems with improved acoustic quality, contemporary voice AI implementations fail to replicate the responsiveness, naturalness, and conversational fluidity characteristic of human dialogue. This performance gap stems from fundamental technical constraints spanning latency, duplex communication, reliability, and economic scalability.

Voice AI encompasses the integrated technologies enabling machines to perceive, process, and respond to human speech in real-time conversational contexts. The field has evolved from command-and-control interfaces toward systems attempting sustained dialogue, yet substantial barriers remain. Current demonstrations, while exhibiting improved acoustic naturalness, typically operate in controlled environments and break down under real-world conditions involving ambient noise, overlapping speech, and the temporal dynamics of natural conversation.

This synthesis examines voice AI technology through the lens of foundational model development, focusing on architectural choices, performance constraints, and deployment economics. The central thesis posits that achieving human-level conversational capability requires not selecting between cascaded and speech-to-speech approaches, but rather developing hybrid architectures that combine the natural interaction of full-duplex models with the reliability and tool-calling capabilities of cascaded systems. The analysis proceeds by establishing the technical landscape, examining limitations of existing paradigms, quantifying performance constraints, and identifying innovations required to bridge the capability gap.

2. Background and Related Work

2.1 Foundational Voice Model Taxonomy

Contemporary voice AI systems comprise several foundational model types operating independently or in combination. Speech-to-text (STT) models perform acoustic-to-linguistic transcription, converting spoken utterances into text representations. Text-to-speech (TTS) models synthesize natural-sounding speech from text input, often incorporating prosody modeling and voice cloning capabilities. Speech-to-speech models directly transform input audio to output audio without intermediate text representation, potentially reducing latency through elimination of transcription and synthesis stages. Dialogue models manage conversational context, turn-taking, and response generation. The architectural integration of these components fundamentally determines system performance characteristics.

2.2 Architectural Paradigms and Trade-offs

Two primary architectural paradigms structure current voice AI development. Cascaded systems employ sequential pipelines where speech input undergoes transcription to text, processing by language models, and synthesis back to speech. This approach leverages mature text-based language models and enables integration of tool-calling and retrieval-augmented generation, but introduces cumulative latency at each pipeline stage. Advanced implementations utilize streaming variants with semantic voice activity detection and voice cloning to reduce latency, yet fundamental constraints remain. Alternatively, speech-to-speech models process audio directly, eliminating intermediate text representations and reducing theoretical latency. However, these models face challenges in reliability, observability, and integration with external tools and knowledge sources.

2.3 Human Conversational Benchmarks

Human conversation exhibits several characteristics serving as performance benchmarks for voice AI systems. Conversational turn-taking in human dialogue typically occurs with latencies under 200 milliseconds for the complete perception-cognition-production cycle. Full-duplex communication, wherein participants simultaneously process incoming speech while producing outgoing speech, enables natural conversational flow including back-channeling—brief vocalizations such as "mhm" or "yeah" that signal attention without interrupting the speaker's turn. Back-channeling frequency and form vary substantially across cultures; Japanese conversation, for instance, exhibits up to 20% overlapped speaking time as a politeness signal. Additionally, human conversation leverages paralinguistic information including tone, emotion, and vocal cues signaling comfort or distress. These characteristics establish the performance envelope that voice AI systems must achieve for natural interaction.

3. Core Analysis

3.1 Latency Constraints in Cascaded Architectures

Cascaded systems face inherent latency constraints that prevent achievement of human conversational timing. The classic cascade architecture sequences speech-to-text transcription, language model processing, and text-to-speech synthesis. Even with streaming implementations, TTS latency alone exceeds 200 milliseconds—equivalent to the entire human conversational cycle from perception through response production. This fundamental constraint means cascaded systems cannot match human responsiveness regardless of optimization efforts in individual components.

The introduction of tool-calling capabilities has shifted the latency bottleneck from TTS to function execution. Tool calls introduce unpredictable latency ranging from 500 milliseconds to 4 seconds, creating perceptible gaps in conversational flow. This challenge has motivated development of filler strategies wherein language models generate natural conversational continuations while awaiting tool results, subsequently integrating retrieved information into ongoing dialogue. This approach maintains conversational naturalness but acknowledges rather than solves the fundamental latency constraint. The cumulative effect of these delays renders cascaded systems unsuitable for applications requiring human-level conversational responsiveness.

3.2 Speech-to-Speech Models and Duplex Limitations

Speech-to-speech models reduce latency by eliminating intermediate text representations, directly transforming input audio to output speech. However, nearly all speech-to-speech implementations except Moshi operate in half-duplex mode, wherein the model alternates between listening and speaking states but cannot perform both simultaneously. This architectural limitation prevents handling of natural conversational phenomena including overlapping speech, back-channeling, and environmental interruptions such as coughing or ambient noise.

Half-duplex constraints fundamentally limit conversational naturalness. Human dialogue involves substantial ambiguity in turn-taking, with speakers frequently beginning utterances before previous speakers complete their turns, or producing back-channel responses during another speaker's extended turn. Half-duplex models cannot process these patterns, forcing rigid turn-taking that feels unnatural to human users. The Moshi model represents the sole full-duplex implementation, enabling simultaneous processing of incoming speech while producing output, thereby supporting natural overlapping speech patterns and back-channeling. This capability demonstrates technical feasibility of full-duplex speech-to-speech models, though production-readiness challenges remain.

3.3 Intelligence and Reliability Trade-offs

A fundamental tension exists between conversational naturalness and system intelligence. Speech-to-speech models trained on audio versions of factual question-answering datasets demonstrate natural acoustic characteristics but lack incentive to capture paralinguistic information present in training data. While these models technically possess capacity for paralinguistic understanding—detecting tone, emotion, and conversational cues—this capability remains unexploited due to training objective misalignment. Furthermore, speech-to-speech models lack the reliability, observability, tool-calling capabilities, and personalization features mature in cascaded systems.

Conversely, cascaded systems built on text-based language models inherit robust tool integration, retrieval-augmented generation, and fine-tuning capabilities, but operate with high latency and limited access to paralinguistic information lost during transcription. This creates a binary choice in current implementations: natural-sounding but unreliable speech-to-speech models versus intelligent but high-latency cascaded systems. The path forward requires architectures that combine full-duplex conversational naturalness with the reliability, intelligence, and tool-calling capabilities of cascaded approaches.

3.4 Scalability Economics and On-Device Processing

Voice AI faces substantial economic barriers to scalable deployment. Most hyperscaler voice model offerings operate at a loss, serving as marketing investments rather than profitable products. While language model costs have decreased to negligible levels and speech-to-text transcription remains affordable, TTS dominates cost structures. Developers report exhausting entire funding rounds on TTS API costs before achieving user growth, rendering current pricing models unsustainable for consumer-scale voice applications.

This economic constraint has motivated development of on-device processing capabilities. The Gradian Phonon model represents a technical approach to this challenge, implementing TTS with under 100 million parameters capable of execution on smartphone CPUs rather than requiring GPU acceleration. This model includes voice cloning capabilities while maintaining performance competitive with existing on-device solutions. On-device processing eliminates per-request API costs, enabling zero-marginal-cost scaling for voice applications. Additionally, local processing addresses privacy concerns, as users increasingly prefer local computation over cloud-based processing that creates data breach vulnerabilities. The viability of on-device models suggests a path toward economically sustainable voice AI deployment.

4. Technical Insights

4.1 Quantitative Performance Characteristics

Empirical measurements establish concrete performance benchmarks for voice AI systems. Cascaded TTS latency exceeds 200 milliseconds, while human conversational response occurs within 200 milliseconds for the complete perception-cognition-production cycle. Tool call latency ranges from 500 milliseconds to 4 seconds, representing the primary bottleneck in modern voice agent systems. These measurements indicate that achieving human-level responsiveness requires order-of-magnitude latency reductions across multiple system components.

4.2 Architectural Implementation Considerations

Practical implementation of advanced voice AI systems requires specific architectural choices. Streaming implementations of speech-to-text and text-to-speech with semantic voice activity detection reduce latency compared to batch processing but remain insufficient for human-level performance. Filler strategies enabling language models to maintain conversational flow during tool execution require careful prompt engineering to ensure natural integration of retrieved information. Full-duplex speech-to-speech models necessitate architectural innovations enabling simultaneous input processing and output generation, with Moshi demonstrating feasibility through specialized training approaches.

4.3 Model Scaling and Efficiency

On-device deployment imposes strict efficiency constraints. The Gradian Phonon model's sub-100-million parameter count represents the scale necessary for CPU execution on consumer devices. This parameter budget requires careful architecture design and training procedures to maintain quality competitive with larger models. Voice cloning capabilities in resource-constrained models demand efficient speaker embedding approaches that minimize computational overhead while preserving voice characteristics. These efficiency requirements distinguish on-device models from cloud-based implementations optimized for quality without resource constraints.

5. Discussion

The analysis reveals that achieving human-level voice AI requires simultaneous advancement across multiple technical dimensions rather than incremental improvement of existing approaches. The fundamental tension between conversational naturalness and system intelligence cannot be resolved through optimization within current architectural paradigms. Instead, hybrid approaches combining full-duplex speech-to-speech conversational capabilities with cascaded system reliability and tool integration represent the necessary path forward.

Several critical gaps remain in current voice AI research and development. Paralinguistic understanding capabilities, while technically present in speech-to-speech models, require training procedures and objectives that incentivize capture and utilization of tonal, emotional, and contextual cues. The integration of tool-calling and retrieval-augmented generation into full-duplex speech-to-speech models presents both architectural and training challenges. Furthermore, cross-cultural variation in conversational norms—exemplified by differing back-channeling patterns—necessitates culturally-adaptive models rather than single universal implementations.

The economic dimension of voice AI deployment merits particular attention. The current cost structure, dominated by TTS API fees, creates a barrier to consumer-scale applications that on-device processing may alleviate. However, on-device deployment introduces challenges in model updating, personalization, and capability extension that cloud-based systems handle straightforwardly. The optimal architecture may involve hybrid approaches where lightweight on-device models handle latency-critical conversational flow while cloud-based systems provide computationally intensive capabilities such as complex reasoning and knowledge retrieval.

6. Conclusion

This synthesis establishes that contemporary voice AI systems remain substantially distant from human-level conversational capability due to fundamental constraints in latency, duplex communication, reliability, and scalability economics. Cascaded systems provide intelligence and tool integration but cannot achieve human conversational timing due to inherent latency exceeding 200 milliseconds. Speech-to-speech models reduce latency but predominantly operate in half-duplex mode incompatible with natural conversational flow, with full-duplex implementations lacking production reliability.

The path toward human-level voice AI requires hybrid architectures combining full-duplex conversational models with cascaded system intelligence, alongside on-device processing innovations addressing economic scalability. Practical implications for system designers include prioritizing full-duplex capability in speech-to-speech models, developing filler strategies for tool-calling latency management in cascaded systems, and implementing on-device models for cost-effective deployment. Future work should focus on training procedures that incentivize paralinguistic understanding, architectural innovations enabling tool integration in speech-to-speech models, and culturally-adaptive conversational systems. Voice AI remains a technically challenging domain where the final increments toward human-level performance represent the most difficult engineering and scientific problems.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub