'How We Built Zeta2: Training an Edit Prediction Model in Production — Ben Kunkle, Zed'

Zed 2's edit prediction model is trained through a distillation pipeline using production data, frontier models as teachers, and iterative filtering to creat...

2026-06-03 By Sean Weldon

Abstract

This paper examines the training methodology for Zed 2, a specialized edit prediction model designed for real-time code completion within the Zed editor. The system employs knowledge distillation from frontier models combined with production data collection to generate training examples for keystroke-level prediction. A key innovation involves using the student model itself to filter noisy settled data, reducing computational costs from one million frontier model requests per 100,000 examples to near-zero while maintaining quality. The approach identifies optimal training examples in a "middle zone" of edit distance—neither trivially obvious nor clearly erroneous—to capture novel coding patterns beyond the model's training cutoff. Evaluation combines offline metrics including Levenshtein-based n-gram comparison and reversal ratios with production acceptance rates. The methodology demonstrates that specialized, task-focused models trained through iterative distillation and intelligent filtering can approach teacher model quality while meeting stringent latency requirements for real-time prediction.

1. Introduction

Modern integrated development environments increasingly incorporate predictive models to accelerate software development through intelligent code completion. However, deploying such systems at production scale presents fundamental challenges: predictions must execute on every keystroke, requiring extreme latency optimization, while simultaneously maintaining sufficient accuracy to provide value to developers. This tension between performance constraints and prediction quality necessitates specialized approaches distinct from general-purpose language models.

Edit prediction represents a specific formulation of the code completion problem: given a code region surrounding the cursor position, recent editing history, type and variable definitions, and diagnostic information, the model must predict the developer's next edit. Unlike general code generation tasks, edit prediction operates within tightly constrained contexts and must deliver results within milliseconds to maintain editor responsiveness. As noted in the source material, the model "obviously needs to be very fast cuz it runs on every keystroke," fundamentally shaping architectural decisions.

This paper analyzes the training pipeline for Zed 2, an edit prediction model that addresses these constraints through knowledge distillation from frontier models, production data collection, and novel filtering techniques. The central research question examines how to efficiently generate high-quality training data from inconsistent teacher model outputs and noisy production usage patterns while maintaining the speed requirements for real-time operation. The analysis proceeds through four main components: the data collection and distillation pipeline, prediction quality filtering mechanisms, training example selection strategies, and evaluation methodologies for both offline and production environments.

2. Background and Related Work

2.1 Knowledge Distillation Framework

Knowledge distillation transfers capabilities from large, computationally expensive teacher models to smaller, faster student models. In this context, frontier models—large-scale language models with broad capabilities—serve as teachers, while Zed 2 functions as the specialized student. This approach enables the student model to focus exclusively on edit prediction rather than maintaining general-purpose capabilities, allowing architectural optimizations for speed. The task-specific design represents a deliberate trade-off: sacrificing generality to achieve the millisecond-level latency required for keystroke-responsive prediction.

2.2 Edit Distance Metrics and Evaluation

Levenshtein distance measures the minimum number of single-character edits required to transform one string into another, providing a quantitative basis for comparing predicted edits to actual developer actions. The delta car f metric extends this concept through n-gram comparison of various sizes, enabling more nuanced evaluation of prediction quality beyond simple character-level matching. These metrics form the foundation for both filtering training data and evaluating model performance in offline settings.

2.3 Production Data Collection

The system employs opt-in production data collection, capturing complete contextual snapshots at prediction time. This includes not only the code text but also semantic information such as type definitions, variable scopes, and compiler diagnostics. This comprehensive context enables both accurate teacher model predictions during distillation and rich input representations for the student model during inference.

3. Core Analysis

3.1 Data Collection and Distillation Pipeline Architecture

The training pipeline begins with production data snapshots that capture all contextual information present at prediction time. These snapshots serve as inputs to the frontier model distillation process, where teacher models generate predicted edits given the collected contextual data. However, a critical challenge emerges: frontier models produce highly inconsistent outputs. As characterized in the source material, "If you ask them 100,000 times, they're going to give you 100,001 answers." This inconsistency necessitates extensive prompt engineering to improve prediction quality and requires downstream filtering mechanisms to identify reliable training examples.

The pipeline employs a JSONL format where each processing stage adds or rearranges fields, creating a fluid, dynamic process adaptable to experimentation. Critically, all work up to the repair step is cacheable and reusable across experiments, enabling efficient iteration on model architectures and training procedures. Training datasets typically contain 100,000 examples at peak, with smaller experiments operating in the 10,000-50,000 example range.

3.2 Prediction Quality Filtering and Repair Mechanisms

Offline static evaluation employs heuristics to detect problematic predictions before they enter the training dataset. These heuristics identify predictions that undo recent typing, violate editable region boundaries, or exhibit other pathological behaviors. When predictions fail these quality checks, a repair step sends the failed prediction back to the frontier model with explicit context: "Hey, it failed in this way. Can you fix it?" The repaired predictions are then converted to the expected student model output format.

Prompt formatting varies across experiments, with different configurations including or excluding diagnostics and varying the length of edit history provided to the model. This experiment-specific approach enables systematic exploration of which contextual features most improve prediction quality while maintaining the speed requirements for production deployment.

3.3 Settled Data Collection and Cost-Efficient Filtering

Settled data captures the final code state after a user stops editing a region, providing ground truth for evaluating prediction quality. The system employs a 10-second pause heuristic to detect when editing has stabilized, then snapshots the final code state. However, settled data is inherently noisy: users change their minds, autonomous agents rewrite code, and predictions that appeared reasonable at prediction time may no longer seem appropriate given the final code state.

The initial filtering approach generated 10 teacher predictions for each example and checked whether any matched the settled state using Levenshtein distance. However, this approach proved prohibitively expensive: 100,000 examples multiplied by 10 predictions equals one million frontier model requests. The solution represents a key innovation: substituting the trained student model (Zed 2) for the teacher during filtering. Since the student model runs locally with negligible cost, this reduces the computational expense from prohibitive to near-zero while maintaining filtering effectiveness. As the source notes, "Our student models or Zed 2 is approaching the teacher in terms of quality of prediction," validating this substitution.

3.4 Optimal Training Example Selection Strategy

Analysis of edit distance distributions reveals three distinct zones relative to the settled state. Predictions far from the settled state represent confident noise—cases where the model's prediction clearly diverged from the user's intent. Predictions very close to the settled state represent obvious completions, exemplified by the case: "You typed function add A plus, it's obviously B." The middle zone contains ideal training examples: predictions that are neither trivially obvious nor clearly erroneous.

This middle zone captures novel patterns beyond the student model's training cutoff, including new library functions, updated APIs, and previously unseen code patterns. Rather than training on the settled state itself, the system trains on the closest prediction to the settled state. This approach reduces noise by avoiding cases where the settled state differs from reasonable predictions due to user mind-changes or subsequent refactoring rather than prediction errors.

4. Technical Insights

4.1 Distillation Pipeline Implementation

The distillation pipeline demonstrates several critical implementation considerations. Frontier model inconsistency requires generating multiple predictions during evaluation—the system generates three distinct teacher predictions since multiple correct answers typically exist for any given context. Prompt engineering significantly impacts teacher model quality, though specific prompt formulations remain experiment-dependent. The cacheable pipeline architecture enables rapid experimentation by avoiding redundant computation when testing new model architectures or training procedures.

4.2 Evaluation Methodology and Metrics

The evaluation framework combines multiple complementary signals. The delta car f metric provides offline assessment through Levenshtein-based n-gram comparison at various sizes. The reversal ratio tracks predictions that undo recent user typing, serving as a quality signal for pathological model behavior. Diagnostic error counts compare error states before and after applying predictions, providing insight into whether predictions introduce or resolve compilation errors.

Production evaluation employs held-out test sets to prevent training-testing overlap and tracks the kept rate—the proportion of predictions users accept in real-world usage. A dashboard monitors both acceptance rates and latency for live experiments. Importantly, offline evaluation metrics do not necessarily correlate with user preferences in the editor, necessitating production experimentation for final model selection.

4.3 Production Deployment Strategy

Deployment follows a gradual rollout procedure: experiments begin at 15% of traffic, incrementally increase to 20%, then proceed to full deployment if metrics remain positive. This conservative approach enables early detection of issues while limiting user impact. The model designated V0211 seed coder was ultimately released as Zed 2 following successful production validation.

5. Discussion

The training methodology for Zed 2 demonstrates several broader implications for deploying specialized machine learning models in latency-critical production environments. The use of student models for filtering teacher-generated training data represents a form of bootstrapping where the model being trained contributes to its own training data curation. This approach becomes viable only when the student model approaches teacher quality, suggesting an iterative training process where early models enable more efficient training of subsequent versions.

The three-zone distance distribution for training example selection addresses a fundamental challenge in learning from production data: distinguishing signal from noise in environments where ground truth is ambiguous. The middle zone strategy implicitly recognizes that both extremes—predictions very close to and very far from observed outcomes—provide limited learning signal. This insight may generalize to other domains where production usage data exhibits similar noise characteristics.

The disconnect between offline metrics and production acceptance rates highlights persistent challenges in machine learning evaluation. While offline metrics enable rapid iteration and systematic comparison, they serve as proxies for user value rather than direct measurements. The necessity of production experimentation for final model selection suggests that some aspects of model quality remain difficult to capture in offline evaluation frameworks, particularly those related to user experience and workflow integration.

Future investigation might explore several directions. The optimal ratio between teacher-generated and settled-data-filtered examples remains an open question. The relationship between edit distance thresholds and training effectiveness could be systematically characterized across different programming languages and code patterns. Additionally, the extent to which this methodology generalizes to other predictive tasks in interactive environments warrants examination.

6. Conclusion

This analysis examined the training pipeline for Zed 2, revealing a methodology that balances prediction quality, computational efficiency, and latency constraints through knowledge distillation, intelligent filtering, and production data utilization. The key innovation—using student models to filter settled data—reduces computational costs by six orders of magnitude while maintaining training data quality. The three-zone training example selection strategy identifies novel patterns that extend beyond the model's training cutoff, focusing learning on genuinely informative cases rather than obvious completions or clear noise.

The practical implications extend beyond code completion to any domain requiring real-time prediction with tight latency constraints. The methodology demonstrates that specialized, task-focused models can approach teacher model quality while meeting performance requirements infeasible for general-purpose systems. Organizations deploying similar systems should consider the interplay between offline metrics and production acceptance, recognizing that final validation requires real-world experimentation despite the efficiency advantages of offline evaluation. The success of Zed 2 validates the approach of trading generality for specialization when task requirements demand extreme optimization along specific dimensions such as latency or domain specificity.

Sources

How We Built Zeta2: Training an Edit Prediction Model in Production — Ben Kunkle, Zed - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub