'From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google'
Google AI Edge provides a comprehensive stack for deploying language models on-device through system-level GenAI (like Gemini Nano via AI Core) and customiza...
By Sean WeldonOptimizing On-Device Language Model Deployment: Architecture and Performance Analysis of Google AI Edge
Abstract
This synthesis examines Google AI Edge, a comprehensive deployment framework enabling efficient language model execution on resource-constrained mobile and embedded devices. The architecture presents a bifurcated approach: system-level generative AI through pre-installed models (Gemini Nano) versus application-specific Tiny Language Models (TLMs) under one billion parameters. Analysis of the Lighter TLM runtime reveals support for 2.7 billion devices through Android OS integration, while performance benchmarks demonstrate Function Gemma (270M parameters) achieving approximately 2,000 tokens per second prefill rates on legacy hardware. Fine-tuning methodologies using synthetic datasets improved function calling accuracy from 46% to over 90% across eight of ten tested functions. The framework prioritizes latency reduction, privacy preservation, and offline functionality while maintaining cross-platform compatibility through standardized model formats. These findings have significant implications for developers seeking to balance model capability against deployment constraints in production environments.
1. Introduction
The deployment of large language models on edge devices represents a fundamental engineering challenge requiring careful optimization of competing constraints. While cloud-based inference maximizes model capacity and computational resources, on-device execution provides critical operational advantages: sub-100ms latency through elimination of network round-trips, data privacy via local processing, offline availability independent of connectivity, improved reliability without external service dependencies, and substantial cost reduction through elimination of per-query inference charges. However, mobile and embedded platforms impose severe restrictions on model size, memory footprint, and computational throughput that necessitate architectural compromises.
Tiny Language Models (TLMs), defined as models containing fewer than one billion parameters, represent a specialized category optimized for edge deployment scenarios. These models sacrifice some capability breadth to achieve computational efficiency suitable for resource-constrained environments. The central engineering question concerns how to maximize functional utility within these parameter constraints while maintaining acceptable performance characteristics for production applications.
Google AI Edge addresses this deployment challenge through a comprehensive stack encompassing MediaPipe for multimodal processing, Lighter TLM as the LLM execution harness, and Lighter T runtime (formerly TensorFlow Lite) providing cross-platform inference capabilities. The framework offers developers two distinct deployment paradigms: System GenAI leveraging pre-installed, hardware-optimized models accessible via standardized APIs, and App GenAI enabling custom TLM integration directly within application binaries. This analysis examines the architectural foundations, performance characteristics, and practical deployment patterns of the Google AI Edge stack, with particular emphasis on function calling capabilities, agent skill composition, and quantified performance improvements through fine-tuning methodologies.
2. Background and Related Work
The Lighter T runtime represents mature infrastructure integrated into the Android operating system, supporting over 2.7 billion devices with daily invocations across production deployments. This runtime provides hardware abstraction enabling execution across CPU, GPU, and Neural Processing Unit (NPU) architectures, allowing platform-specific optimization while maintaining consistent API interfaces. The system supports deployment across Android, iOS, web platforms, and embedded systems, establishing broad compatibility across the mobile ecosystem.
The Gemma model family serves as the foundational architecture for both system-level and application-level implementations within the Google AI Edge framework. Gemini Nano, the system-integrated variant accessible through AI Core, utilizes Gemma 4 E2B and E4B as base models. This approach follows established patterns of system-level AI services where shared infrastructure amortizes optimization costs across the device ecosystem. Applications utilizing System GenAI benefit from zero binary size increase, as models remain pre-installed at the operating system level, while also receiving ongoing optimization updates without requiring application modifications.
The deployment workflow follows a standardized pipeline beginning with models in Hugging Face Transformers format, proceeding through conversion via Lighter T Torch export tooling, and culminating in the .tflm format—a unified packaging standard bundling tokenizer and model weights in a single file. This format replaces legacy .task files for LLM-specific deployments, though .task files remain utilized for multi-component models requiring additional code integration such as face mesh processing.
3. Core Analysis
3.1 Architectural Trade-offs: System versus Application GenAI
The framework presents developers with a fundamental architectural decision between System GenAI and App GenAI deployment patterns, each offering distinct advantages aligned with specific use case requirements. System GenAI through Gemini Nano provides pre-installed, highly optimized models that impose no application size penalty while delivering general-purpose language understanding capabilities. This approach proves optimal for applications requiring standard natural language processing functionality without specialized domain knowledge or custom behavior patterns.
Conversely, App GenAI via Lighter TLM enables higher customization and broader device reach for boutique use cases requiring specialized functionality. This deployment pattern necessitates increased development effort, including model selection or training, integration testing, and binary size management. However, it provides complete control over model behavior, update cycles, and feature availability independent of system-level model deployment schedules. The framework documentation explicitly recommends System GenAI as the default starting point, with migration to App GenAI only when customization requirements exceed system model capabilities.
3.2 Function Calling Performance and Fine-Tuning Methodology
Function Gemma, a specialized 270M parameter model based on Gemma 3 architecture, demonstrates the viability of robust function calling on resource-constrained devices when combined with appropriate fine-tuning methodologies. Baseline performance benchmarks on Pixel 7 hardware—representative of legacy mobile devices—achieved approximately 2,000 tokens per second during prefill operations and 140 tokens per second during decode phases. These performance characteristics enable interactive applications despite the computational limitations of older hardware platforms.
Out-of-box Function Gemma achieved only 46% success rate on application intent recognition tasks, demonstrating insufficient accuracy for production deployment without additional optimization. However, fine-tuning workflows utilizing synthetically generated datasets—created using larger models such as Flash—improved success rates to over 90% for eight of ten tested functions. This methodology involves defining target functions, generating diverse training examples covering edge cases and error conditions, and conducting focused fine-tuning on the compact model architecture. While this approach requires more labor investment than prompting larger cloud-based models, it enables robust on-device deployment with predictable performance characteristics suitable for production environments.
3.3 Agent Skills Framework and Modular Capability Composition
The agent skills architecture implements a prompt-based approach enabling modular capability composition through selective skill loading. The system employs a load_skill tool call mechanism allowing models to dynamically load functionality on demand without exposing complete function signatures or implementation details in the base prompt. This architecture reduces context window consumption while enabling extensible capability sets that can grow beyond initial deployment.
Skills can incorporate custom JavaScript for UI rendering, enabling rich interactive experiences such as animated selection wheels or embedded map displays directly within the conversational interface. Gemma 4 demonstrates capability to handle multiple skills within single conversations, switching between them sequentially as user intent shifts. However, the framework documentation notes that single-interaction multi-skill calling—where multiple skills execute simultaneously within one turn—remains under development for improved robustness.
The community-driven aspect of skills development demonstrates rapid adoption, with developers creating example skills within one week of Gemma 4 release. The architecture supports publishing skills to GitHub repositories with loading via URL within the Google AI Edge Gallery application, enabling distributed skill development without centralized approval processes. This approach mirrors successful patterns from browser extension ecosystems and package management systems.
3.4 Production Implementation Case Studies
The Eloquent transcription application demonstrates practical deployment of dual-engine architecture utilizing TLMs for specialized tasks. The system employs separate ASR (Automatic Speech Recognition) engine and text polishing engine, both constructed using Gemma 3-based models with several hundred million parameters. This architecture enables personalization through custom dictionaries for domain-specific terminology while performing real-time filler word removal (eliminating "ums" and "ahs" from transcribed output). The application currently supports iOS platforms exclusively and remains unavailable in European markets, indicating ongoing regional deployment expansion.
The Google AI Edge Gallery application serves as both demonstration platform and development tool, showcasing skills functionality, AI chat, image analysis, audio transcription, and third-party model support including Llama and Phi architectures. Both applications maintain open-source codebases, with iOS implementations scheduled for release following completion of Swift API development. This open-source strategy enables community validation of deployment patterns while providing reference implementations for developers building similar applications.
4. Technical Insights
Performance optimization for TLM deployment requires careful consideration of hardware acceleration capabilities across target device populations. The Lighter T runtime's support for CPU, GPU, and NPU execution paths enables platform-specific optimization, though performance characteristics vary substantially across architectures. Apple's FastVLM, a 500M parameter visual language model, demonstrates specialized optimization for Qualcomm NPU hardware acceleration, indicating the importance of architecture-specific tuning for maximal throughput.
The .tflm format standardization represents significant progress toward simplified deployment workflows. By bundling tokenizer and model weights in unified packaging with open developer tooling, the format eliminates integration complexity previously associated with managing separate components. Comprehensive performance data available on Gemma Lighter TLM model cards provides developers with empirical benchmarks for capacity planning across device generations.
Fine-tuning with synthetic data generation emerges as the recommended approach for models under 200M parameters, where base model capabilities prove insufficient for specialized tasks. This methodology enables creation of large, diverse training datasets without manual annotation costs, though it requires access to larger models for data generation and careful validation to prevent synthetic data artifacts from degrading production performance.
The skills framework's JavaScript integration capability presents both opportunities and security considerations. While custom rendering enables rich interactive experiences, it also introduces potential attack surfaces requiring careful sandboxing and permission management. The framework's approach of executing JavaScript within skill contexts necessitates robust isolation mechanisms to prevent malicious skills from accessing sensitive application state or device resources.
5. Discussion
The Google AI Edge framework demonstrates that specialized TLMs under 300M parameters can achieve production-grade performance for focused tasks when combined with appropriate fine-tuning methodologies. The improvement from 46% to over 90% accuracy through synthetic data fine-tuning represents a substantial capability enhancement, though it requires recognition that this approach trades upfront development effort for improved inference efficiency and deployment flexibility. This trade-off proves favorable for applications requiring consistent low-latency responses, offline operation, or privacy-preserving local processing.
The bifurcated architecture between System GenAI and App GenAI reflects broader industry trends toward tiered AI service models. System-level integration provides commoditized capabilities with minimal integration burden, while application-level deployment enables differentiation through specialized functionality. This pattern mirrors the evolution of graphics APIs, where standardized system services coexist with application-specific rendering engines for specialized use cases.
Several areas warrant further investigation. The framework documentation notes ongoing work to improve single-interaction multi-skill calling robustness, indicating current limitations in parallel skill execution. Additionally, the regional availability restrictions on applications like Eloquent suggest regulatory or technical challenges in global deployment that merit examination. The reliance on synthetic data for fine-tuning raises questions about potential distribution shift between training data and real-world usage patterns, particularly for edge cases not adequately represented in generated datasets.
The community-driven skills ecosystem presents interesting parallels to browser extension markets and package repositories. Success of this model depends critically on discoverability mechanisms, quality assurance processes, and security review capabilities that remain under development. The framework's open-source positioning facilitates community contribution while potentially fragmenting the skills ecosystem across incompatible forks or platform-specific implementations.
6. Conclusion
This analysis demonstrates that Google AI Edge provides a comprehensive framework for deploying language models on resource-constrained devices through careful architectural partitioning between system-level and application-level integration patterns. The Lighter TLM runtime's support for 2.7 billion devices establishes substantial production deployment scale, while performance benchmarks confirm that models under 300M parameters can achieve interactive latency characteristics on legacy hardware platforms.
The documented improvement from 46% to over 90% function calling accuracy through fine-tuning methodologies provides empirical validation that TLMs can achieve production-grade performance for specialized tasks despite parameter constraints. The agent skills framework demonstrates viable approaches for modular capability composition, enabling extensible functionality without proportional context window consumption. Practical implementations such as Eloquent transcription validate these architectural patterns in production applications serving real user populations.
For practitioners, these findings suggest that on-device deployment merits serious consideration for applications prioritizing latency, privacy, offline operation, or cost efficiency. The framework's standardized tooling and open-source reference implementations reduce integration barriers, while comprehensive performance documentation enables informed capacity planning. Future work should focus on expanding multi-skill coordination capabilities, developing robust security frameworks for community-contributed skills, and establishing best practices for synthetic data generation that minimize distribution shift in production deployments.
Sources
- From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google - Original Creator (YouTube)
- Analysis and summary by Sean Weldon using AI-assisted research tools
About the Author
Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.