'TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google'

'Edge AI deployment is rapidly evolving with two complementary trends: system-level GenAI using medium-sized models (2-5B parameters) built into operating sys...'

By Sean Weldon

Tiny Language Models and Agentic Systems at the Edge: Infrastructure, Optimization, and Deployment Strategies

Abstract

Edge-based deployment of generative AI systems represents a fundamental architectural shift from cloud-centric inference paradigms. This synthesis examines two complementary deployment strategies emerging in production systems: system-level GenAI utilizing 2-5 billion parameter foundation models integrated into operating systems, and in-app GenAI employing sub-1 billion parameter models fine-tuned for specific tasks. The analysis evaluates Google's MediaPipe Lite TLM infrastructure, the Gemma 4 model family optimized for edge execution, and novel architectural patterns including progressive skill disclosure and constrained decoding for agentic workflows. Key findings demonstrate that models below 500 million parameters require task-specific fine-tuning to achieve production reliability, with 20-40 point improvements in evaluation metrics. Performance benchmarks indicate thousands of tokens per second on high-end mobile GPUs and 133 tokens/second on resource-constrained platforms. These advances enable latency-sensitive applications, preserve user privacy through on-device processing, and reduce operational costs while supporting multimodal capabilities and function calling at the edge.

1. Introduction

The deployment of generative artificial intelligence has traditionally depended upon cloud-based infrastructure, where massive computational resources serve large language models through network APIs. However, this centralized paradigm introduces fundamental constraints: network latency incompatible with real-time applications such as live voice translation, privacy vulnerabilities from transmitting sensitive user data, dependency on continuous connectivity, and escalating operational costs as usage scales. Edge AI—the execution of machine learning inference directly on user devices—addresses these limitations while introducing novel technical challenges in model optimization, memory management, and cross-platform deployment.

Recent advances in model compression, quantization techniques, and specialized hardware acceleration have enabled the deployment of increasingly capable language models on resource-constrained devices. This synthesis examines the current state of edge AI deployment through the lens of production systems, focusing on infrastructure that supports billions of deployed instances across mobile, desktop, IoT, and web platforms. The analysis reveals two distinct architectural approaches that balance model capability against device constraints: system-level foundation models of 2-5 billion parameters built into operating systems, and application-specific tiny models under 1 billion parameters that ship with individual applications.

Central questions addressed include: What technical trade-offs distinguish system-level versus in-app GenAI deployment strategies? How do model size, fine-tuning approaches, and hardware acceleration interact to determine production viability? What novel capabilities—particularly agentic skills and function calling—become feasible at the edge, and through what technical mechanisms? The following sections establish the theoretical foundation for edge AI deployment, analyze the technical infrastructure enabling cross-platform execution, examine specific model architectures and their performance characteristics, and synthesize practical insights for implementing edge-based generative systems.

2. Background and Related Work

Edge AI deployment builds upon decades of research in model compression, quantization, and efficient inference. TensorFlow Lite (TF Lite) emerged as the standard inference framework for mobile and embedded systems, providing a runtime optimized for resource-constrained environments. This infrastructure enables deployment of neural networks across heterogeneous hardware including CPUs, GPUs, and specialized neural processing units (NPUs). The framework ships as part of Android system services and supports cross-platform deployment to iOS, macOS, Linux, Windows, web, and IoT devices.

The theoretical foundation distinguishes between just-in-time (JIT) compilation workflows, which produce single artifacts deployable across CPU and GPU targets, and ahead-of-time (AOT) compilation, which generates device-specific binaries for NPU acceleration through vendor-specific compiler plugins. This architectural distinction reflects fundamental trade-offs between deployment flexibility and hardware-specific optimization. The MediaPipe technology stack, which powers production features such as effects in YouTube Shorts and photo processing in Google Photos, extends this infrastructure to support large language model inference through the MediaPipe Lite TLM runtime specifically designed for cross-platform LLM deployment.

Prior work in edge deployment has primarily focused on computer vision and lightweight classification tasks. The emergence of capable language models under 5 billion parameters—particularly the Gemma model family with Apache 2.0 licensing—has expanded the scope of feasible on-device applications to include multimodal understanding, function calling, and agentic workflows previously restricted to cloud-based systems.

3. Core Analysis

3.1 Deployment Paradigms: System-Level vs. In-App GenAI

Two distinct architectural paradigms have emerged for edge-based generative AI deployment, each optimized for different use cases and device constraints. System-level GenAI integrates 2-5 billion parameter foundation models directly into operating systems—exemplified by Android's AI Core and Apple Intelligence—where these models serve as shared resources customizable through prompting and skills. This approach provides sophisticated capabilities but restricts deployment to premium devices with sufficient memory and computational resources.

In contrast, in-app GenAI employs custom tiny models below 1 billion parameters that load with individual applications, enabling deployment across all device tiers including low-end and mid-range hardware. These application-specific models achieve production reliability through fine-tuning for narrow task domains rather than attempting general-purpose capabilities. The architectural choice reflects a fundamental trade-off: system models offer broader capabilities and shared infrastructure benefits, while in-app models maximize device reach and enable offline functionality on resource-constrained platforms.

The distinction has practical implications for deployment strategy. System models leverage their larger parameter counts (2-5B) to handle diverse tasks through prompting alone, while tiny models below 500 million parameters require fine-tuning to achieve comparable reliability on specific functions. Empirical results from Function Gemma (270M parameters) demonstrate this principle: fine-tuning for voice-to-function calling on 10 Android functions achieved 85-90% reliability overall, with 8 simpler functions exceeding 90-93% reliability—performance levels unattainable without task-specific optimization at this model scale.

3.2 Memory Architecture and Optimization Techniques

The Gemma 4 E2B and E4B models demonstrate novel memory management strategies that enable deployment of billion-parameter models on memory-constrained devices. The E2B model maintains only 2 billion parameters resident in RAM during inference, while larger parameters utilize per-layer embeddings with memory-mapped access. This architecture requires loading only hundreds to thousands of bytes per inference step rather than maintaining the entire parameter set in active memory, dramatically reducing the memory footprint while preserving model capability.

Hardware acceleration strategies vary by target platform and available computational resources. CPU optimization leverages XNNPACK, a specialized library ensuring consistent performance across diverse device types. GPU acceleration utilizes ML-Compute for mobile and edge GPUs, with the JIT workflow producing single lighter_TLM artifacts deployable to both CPU and GPU targets without separate compilation. NPU deployment requires AOT compilation through vendor-specific compiler plugins, generating device-specific binaries dispatched through device drivers. Despite different build workflows, the runtime maintains a consistent API across JIT and AOT execution paths.

Performance benchmarks reveal substantial variation across hardware platforms. The E2B model achieves thousands of tokens per second on high-end Android GPUs, enabling real-time interactive applications. On resource-constrained platforms, the same model produces 133 tokens/second on Raspberry Pi hardware—sufficient for simple image analysis use cases and asynchronous processing workflows. Qualcomm IoT robotics platforms with NPU acceleration demonstrate compelling performance improvements, though specific metrics were not disclosed for competitive reasons.

3.3 Agentic Skills and Function Calling Architecture

The integration of function calling and reasoning capabilities into edge models enables agentic workflows previously restricted to cloud-based systems. The Gemma 4 E2B and E4B models incorporate built-in function calling and "thinking" capabilities, allowing the models to decompose complex tasks, invoke external tools, and synthesize results. This functionality relies on three core tools: load_skill for dynamic capability loading, run_javascript for executing computational scripts, and run_intent for invoking Android native intents.

A critical innovation enabling reliable function calling on small models is the progressive disclosure pattern. Rather than requiring the model to maintain descriptions of all available functions simultaneously—as in traditional Model Context Protocol (MCP) workflows—skills are structured with one-line descriptions in metadata files (skill.md). Detailed function signatures, parameters, and execution logic load on-demand only when the model selects a specific skill. This architectural pattern improves token efficiency and reliability by reducing the context burden during tool selection.

Constrained decoding during tool generation further enhances reliability when the tool set is finite. By restricting the model's output space to valid function calls during tool invocation, the system prevents malformed calls and improves accuracy for small models with limited reasoning capacity. Skills enable three primary augmentation patterns: input augmentation through external data sources (Wikipedia lookup, weather services), output customization through rich rendering (cards, maps, music synthesis), and domain-specific knowledge integration through specialized assets. A community skill ecosystem on GitHub facilitates sharing and curation of capabilities, with featured skills selected from user submissions.

3.4 Fine-Tuning Workflows and Synthetic Data Generation

Models below 500 million parameters require fine-tuning to achieve production-level reliability on narrow tasks, with empirical results demonstrating 20-40 point improvements in evaluation metrics compared to base models. The deployment workflow follows a structured pipeline: models originate in standard transformers format, undergo optimization and quantization through lighter_t_torch (a PyTorch-based package), and compile to lighter_TLM files for deployment across target platforms.

Synthetic data generation from larger cloud models enables efficient fine-tuning of tiny models without extensive human annotation. This approach leverages the superior capabilities of billion-parameter cloud models to generate training examples that capture desired behaviors, which are then distilled into tiny models through instruction fine-tuning. The technique embeds desired behavior patterns directly into model weights rather than relying on prompting strategies that consume context windows and introduce variability.

The AI Edge Eloquent application exemplifies this approach in production. The iOS app combines speech recognition with text polishing using fine-tuned Gemma 3 270M derivatives in a two-stage pipeline. An ASR engine produces unfiltered transcription, which a text polishing engine then processes to remove interjections and correct speech patterns. Personalization occurs through a biasing dictionary where users add uncommon names and technical terms to improve recognition accuracy. The entire system operates offline with no cloud dependency or usage costs, demonstrating the viability of complex workflows on tiny models through appropriate fine-tuning.

LoRA (Low-Rank Adaptation) fine-tuning provides parameter-efficient adaptation with adapter sizes ranging from 8-100MB depending on rank choice. These adapters support hot-swapping without reloading the base model, enabling applications to switch between specialized capabilities dynamically. This modularity allows individual fine-tuned models to serve multiple features within a single application, amortizing the memory cost of the base model across diverse functionality.

4. Technical Insights

The technical findings synthesized from production deployment experience reveal several actionable insights for edge AI implementation. First, model size thresholds critically determine deployment strategy: models below 500M parameters require task-specific fine-tuning for production reliability, models between 500M-1B parameters can handle narrow general-purpose tasks, and models above 2B parameters support broader capabilities through prompting and skills alone. This size-capability relationship directly influences the system-level versus in-app architectural decision.

Second, memory management strategies enable deployment of models larger than available RAM through per-layer embedding approaches. The Gemma 4 E2B architecture demonstrates that maintaining only core parameters (2B) in active memory while memory-mapping larger parameters reduces memory requirements by orders of magnitude with minimal latency impact. Implementation requires careful consideration of storage I/O characteristics and caching strategies to avoid performance degradation from repeated disk access.

Third, progressive disclosure and constrained decoding substantially improve reliability of function calling on small models. Traditional approaches that require models to reason over complete function signatures simultaneously exceed the effective context capacity of tiny models. Progressive disclosure reduces this burden to one-line descriptions during tool selection, loading detailed specifications only after selection. Constrained decoding during tool generation prevents malformed outputs by restricting the generation space to valid function calls.

Fourth, hardware acceleration strategies must account for the JIT versus AOT compilation trade-off. JIT workflows provide maximum deployment flexibility with single artifacts supporting CPU and GPU across platforms, while AOT compilation enables NPU acceleration at the cost of device-specific binaries and more complex deployment pipelines. The consistent API across execution paths allows applications to adapt dynamically to available hardware without code changes.

Limitations include the requirement for fine-tuning infrastructure and expertise when deploying sub-500M parameter models, the device-specific nature of NPU deployment requiring separate compilation for each target platform, and the reduced safety problem surface area requiring careful evaluation for generative versus regenerative applications. The 32K context window of E2B and E4B models, while substantial for edge deployment, constrains certain long-context applications possible with larger cloud models supporting 128K tokens.

5. Discussion

The emergence of viable edge-based generative AI deployment represents a fundamental shift in system architecture with implications extending beyond immediate technical considerations. The demonstrated capability to achieve thousands of tokens per second on mobile GPUs and maintain functional performance on resource-constrained platforms such as Raspberry Pi (133 tokens/second) suggests that latency-sensitive applications previously requiring cloud infrastructure can now execute entirely on-device. This architectural shift has cascading effects on privacy preservation, offline functionality, and operational cost structures.

The bifurcation into system-level and in-app deployment paradigms reflects broader trends in platform evolution. System-level GenAI, integrated into operating systems like Android's AI Core and Apple Intelligence, follows the platform services model where shared infrastructure amortizes resource costs across applications. However, this approach inherently restricts deployment to premium devices, creating a capability divide between device tiers. In-app GenAI democratizes access by enabling deployment across all device classes through smaller, specialized models, but requires individual applications to bear the memory and storage costs of model distribution.

Several knowledge gaps warrant further investigation. The optimal balance between model size and fine-tuning investment remains underexplored—while empirical results demonstrate that sub-500M models require fine-tuning, the precise relationship between model capacity, training data volume, and task complexity requires systematic study. The progressive disclosure pattern for skill loading shows promise but lacks comprehensive evaluation across diverse task domains and model sizes. Additionally, the safety implications of edge-deployed generative models, particularly regarding adversarial inputs and output validation in offline contexts, require deeper analysis as deployment scales.

The demonstrated viability of multimodal capabilities (audio, image, text) in 2-4B parameter edge models suggests convergence toward unified multimodal architectures at the edge. This trajectory aligns with broader industry trends toward multimodal foundation models but introduces novel challenges in memory management, hardware acceleration, and cross-modal reasoning on resource-constrained devices. The Apache 2.0 licensing of the Gemma model family facilitates research and deployment across organizational contexts, potentially accelerating innovation in edge AI architectures.

6. Conclusion

This synthesis has examined the technical foundations, deployment strategies, and practical considerations for edge-based generative AI systems through analysis of production infrastructure and deployed applications. The key finding establishes a clear size-capability relationship: models below 500 million parameters require task-specific fine-tuning achieving 20-40 point metric improvements, while 2-5 billion parameter models support agentic workflows through built-in function calling and multimodal capabilities. The architectural distinction between system-level GenAI (2-5B parameters integrated into operating systems) and in-app GenAI (<1B parameters shipped with applications) reflects fundamental trade-offs between capability and device reach.

Technical innovations enabling this deployment paradigm include per-layer embedding memory management reducing active memory requirements, progressive disclosure patterns improving function calling reliability on small models, and cross-platform compilation workflows supporting heterogeneous hardware acceleration. Performance benchmarks demonstrate viability across device tiers from high-end mobile GPUs (thousands of tokens/second) to resource-constrained platforms (133 tokens/second on Raspberry Pi), enabling diverse application contexts from real-time interaction to asynchronous processing.

Practical applications should consider model size thresholds when selecting between system-level and in-app deployment, invest in fine-tuning infrastructure for sub-500M parameter models, leverage progressive disclosure and constrained decoding for reliable function calling, and evaluate JIT versus AOT compilation trade-offs based on target hardware. Future work should systematically characterize the model size-task complexity-fine-tuning relationship, evaluate progressive disclosure across diverse domains, and develop comprehensive safety frameworks for edge-deployed generative systems operating in offline contexts. The demonstrated capabilities suggest that edge AI deployment will increasingly serve latency-sensitive, privacy-preserving, and cost-conscious applications previously dependent on cloud infrastructure.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub