'Playground in Prod: Optimising Agents in Production Environments — Samuel Colvin, Pydantic'

Prompt optimization using Jeppa combined with managed variables in Logfire enables efficient improvement of AI agents without redeployment, particularly valu...

2026-05-12 By Sean Weldon

Evolutionary Prompt Optimization for AI Agents: A Genetic Algorithm Approach to Production Deployment

Abstract

This paper examines evolutionary prompt optimization methodologies for AI agents through the integration of Jeppa, a genetic algorithm-based optimization library, with Logfire's managed variables system for production deployment. The research demonstrates that systematic prompt optimization using Pareto frontier selection achieves substantial performance improvements—from 87% baseline accuracy to 96.7% on political relations extraction tasks—while enabling runtime configuration changes without code redeployment. The analysis reveals that optimization provides greatest value in private data scenarios where context engineering is critical, though state-of-the-art models often achieve high baseline performance with sufficient information. The methodology combines evolutionary algorithms with structured evaluation frameworks and dynamic variable management, offering practitioners a cost-effective alternative to fine-tuning. Documented production deployments demonstrate cost reductions from $5M to $73K annually through optimized prompt and model selection strategies.

1. Introduction

The proliferation of Large Language Models (LLMs) in production environments has created demand for systematic methodologies to optimize agent performance without extensive model retraining. While fine-tuning remains computationally expensive—often requiring tens of thousands of dollars per iteration—prompt engineering offers a more accessible optimization pathway. However, manual prompt refinement lacks reproducibility, systematic evaluation frameworks, and mechanisms for continuous improvement based on production feedback.

This synthesis examines an integrated approach to prompt optimization combining Jeppa, a genetic algorithm library, with Logfire's observability platform and managed variables system. The methodology addresses a fundamental challenge in AI engineering: improving agent performance on domain-specific tasks while maintaining deployment flexibility and avoiding costly model retraining. The approach proves particularly valuable for organizations working with proprietary datasets, where 98% of data remains private and unavailable to foundation model training processes.

The analysis focuses on three interconnected components: (1) evolutionary prompt optimization through genetic algorithms employing Pareto frontier selection, (2) structured evaluation frameworks for measuring agent performance with deterministic validators, and (3) production deployment mechanisms enabling runtime configuration changes through managed variables. These elements collectively enable iterative improvement of AI agents in production environments, addressing the gap between initial deployment and optimal performance.

The political relations extraction task serves as the primary case study, demonstrating measurable improvements from 87% accuracy with simple prompts to 96.7% with optimized configurations. This 9.7 percentage point improvement illustrates the practical value of systematic optimization, particularly for structured output tasks where deterministic evaluation is feasible.

2. Background and Related Work

2.1 The Pydantic Ecosystem and Observability Infrastructure

The Pydantic ecosystem comprises three integrated products that collectively support the optimization workflow. The Pydantic validation library provides data modeling capabilities, Pydantic AI offers an agent framework for structured output generation, and Logfire delivers observability infrastructure. Logfire extends traditional observability pillars—logs, metrics, and traces—to incorporate evaluations (evals) and managed variables, capabilities specifically designed for AI agent optimization workflows. Built on OpenTelemetry standards, Logfire maintains compatibility with general-purpose observability infrastructure while providing AI-specific features. The platform's future roadmap includes autonomous agent optimization directly from the observability interface, enabling closed-loop improvement cycles.

2.2 Prompt Optimization and Fine-Tuning Alternatives

DSPy pioneered deterministic evaluation-based prompt selection, establishing a foundation for systematic agent optimization through automated prompt engineering. The broader optimization landscape includes multiple competing approaches: fine-tuning for task-specific model adaptation, prompt engineering for context manipulation, and hybrid strategies combining both methodologies. Fine-tuning typically costs tens of thousands of dollars per iteration and risks obsolescence when next-generation foundation models are released. Consequently, prompt optimization represents the preferred approach for most use cases where baseline model capabilities suffice, with model providers explicitly recommending improved harness design over fine-tuning for many applications.

2.3 Evaluation Frameworks and Golden Datasets

Evaluation methodologies for AI agents fall into two primary categories: LLM-as-judge approaches using language models to assess output quality, and deterministic evaluators implementing rule-based validation logic. For structured output tasks, custom deterministic evaluators demonstrate superior reliability compared to LLM-based judgment, which introduces additional variance and potential bias. The creation of golden datasets—reference datasets with validated correct outputs—proves essential for systematic evaluation. These datasets can be constructed through human annotation, high-quality model outputs with manual validation, or implicit user feedback signals from production systems.

3. Core Analysis

3.1 Jeppa: Genetic Algorithm Architecture for Prompt Optimization

Jeppa implements a genetic algorithm specifically designed for optimizing string values, including text prompts and JSON-structured data. Unlike random search or grid search approaches, Jeppa employs Pareto frontier selection, identifying the set of non-dominated candidate solutions that represent optimal trade-offs between competing objectives. The algorithm takes candidate solutions exclusively from this Pareto frontier rather than incorporating random variations, analogous to breeding race horses by selecting only the best performers rather than introducing inferior genetic material into the breeding pool.

The optimization process operates on key-value dictionaries, enabling simultaneous optimization of multiple variables including system prompts, model selection parameters, temperature settings, and maximum token limits. A proposer agent—itself an LLM—generates new prompt candidates based on evaluation feedback and examples from the Pareto frontier. This meta-optimization approach allows the system to generate entirely new system prompts rather than selecting from predefined options, enabling discovery of novel prompt structures but introducing risk of verbosity through prompt expansion rather than refinement.

The political relations extraction task demonstrates Jeppa's effectiveness: optimization improved accuracy from 87% with a simple baseline prompt to 96.7%, surpassing a manually-crafted expert prompt that achieved 92% accuracy. This 4.7 percentage point improvement over expert human engineering validates the genetic algorithm approach for structured output tasks with deterministic evaluation criteria.

3.2 Political Relations Extraction: A Case Study in Structured Output Optimization

The political relations extraction task requires identifying ancestral political relations from Wikipedia pages of UK Members of Parliament while filtering out non-ancestral relations such as spouses, children, and siblings. Foundation models consistently struggle with this constraint-respecting behavior without explicit post-processing filters, making it an ideal optimization target.

The golden dataset was constructed using claude-opus-4.6 with manual validation across approximately 650 UK MPs. Initial experimentation with a simple prompt achieved 87% accuracy, while an expert-written prompt improved performance to 92%. Jeppa optimization on 65 test cases during the optimization phase achieved 96.7% accuracy on the full evaluation set. However, optimization revealed a notable side effect: the proposer agent generated increasingly verbose system prompts by appending instructions rather than refining existing content, suggesting that prompt length constraints may improve optimization outcomes.

Analysis of optimization behavior revealed overfitting risk when the optimizer observes limited subsets of test cases. Specifically, the optimized prompt excluded uncles and aunts from the ancestor definition despite their presence in the golden dataset, because few test cases during optimization included these relation types. This finding emphasizes the importance of maintaining separate training and validation datasets with representative distribution of edge cases.

3.3 Evaluation Infrastructure and Variance Management

Logfire's evaluation interface enables comparison of multiple experimental runs with detailed case-level analysis, supporting systematic optimization workflows. The platform captures agent traces, tool calls, and nested agent execution patterns, providing observability into the optimization process itself. Custom deterministic evaluators compare agent outputs against golden reference data, computing accuracy metrics and identifying specific failure modes.

Variance reduction emerges as a critical consideration for reliable evaluation. Model outputs exhibit stochastic variation across runs, requiring multiple evaluations per test case to obtain stable performance estimates. Production deployments at hedge funds reportedly execute comprehensive evaluation suites 100+ times nightly at approximately $20,000 per evaluation cycle, illustrating the resource commitment required for high-confidence performance measurement. The evaluation batch interface returns lists of float scores, enabling custom cost functions that combine multiple metrics such as accuracy, precision, recall, and computational cost.

3.4 Managed Variables for Production Deployment

Managed variables extend beyond simple text prompts to encompass any Pydantic model with multiple fields, enabling runtime configuration of model selection, temperature, maximum tokens, and system instructions without code redeployment. This capability proves particularly valuable for organizations with complex CI/CD pipelines where deployment friction inhibits rapid experimentation. The system implements the OpenFeature open standard for feature flagging and variable management, ensuring compatibility with existing infrastructure.

Managed variables support A/B testing through targeting controls, allowing gradual rollout of optimized configurations to production traffic. This deployment strategy mitigates risk by enabling performance validation on production data before full adoption. The implementation requires separate API key configuration in Logfire settings, though future development plans include unification with project-level API keys for simplified authentication management.

4. Technical Insights

4.1 Implementation Considerations and Trade-offs

Pydantic AI agents generate structured output by setting output_type to the desired schema, such as List[PoliticalRelation], ensuring type-safe responses that conform to predefined data models. The Jeppa adapter requires asynchronous context management, necessitating creation of new HTTPX connections per proposer agent invocation to maintain thread safety. This architectural decision introduces connection overhead but ensures isolation between concurrent optimization runs.

The optimization process reveals several critical trade-offs. First, model obsolescence risk suggests that next-generation foundation models may supersede carefully optimized prompts, reducing return on investment for optimization efforts. Model providers release improved versions on 6-12 month cycles, potentially rendering prompt optimization obsolete. Second, optimization proves most valuable with large volumes of private data where context engineering is critical—scenarios where foundation models lack domain-specific training data. Third, state-of-the-art models like claude-opus-4.6 often solve tasks correctly without optimization if provided sufficient information, suggesting that optimization efforts should target cost-sensitive scenarios or cases where baseline performance proves inadequate.

4.2 Cost Optimization and Model Selection

The Shopify case study demonstrates dramatic cost reduction through combined prompt optimization and model downgrade strategies. The organization reduced invoice analysis costs from $5M annually to $73K by optimizing prompts and migrating from GPT-5 to qwen-3.5, a smaller, more cost-effective model. This 98.5% cost reduction illustrates that optimization enables use of less capable but more economical models while maintaining acceptable performance levels. The finding suggests that prompt optimization should be evaluated not solely on accuracy improvements but on total cost of ownership, including inference costs, latency, and operational complexity.

4.3 When Optimization Matters Versus When It Does Not

The analysis identifies specific scenarios where optimization provides substantial value versus cases where alternative approaches prove more effective. Optimization matters most in: (1) private data scenarios where foundation models lack domain-specific training, (2) cost-sensitive applications where inference costs dominate operational expenses, (3) structured output tasks with deterministic evaluation criteria, and (4) production systems with established CI/CD pipelines where deployment friction inhibits rapid iteration.

Conversely, optimization provides limited value when: (1) state-of-the-art models already achieve target performance with well-engineered prompts, (2) task spaces exhibit high sparsity and variance, making comprehensive evaluation infeasible (e.g., coding agents typically employ 'vibes-based' assessment rather than formal evals), (3) rapid model evolution suggests that next-generation models will supersede optimized configurations, and (4) organizations lack infrastructure for systematic evaluation and golden dataset creation.

5. Discussion

The integration of genetic algorithms with managed variables represents a significant advancement in production AI agent optimization, addressing the gap between initial deployment and optimal performance. The methodology demonstrates that systematic optimization can achieve meaningful improvements over expert human engineering—4.7 percentage points in the political relations extraction task—while maintaining deployment flexibility through runtime configuration management.

The findings reveal important limitations and considerations for practitioners. Overfitting risk necessitates careful dataset partitioning and representative sampling of edge cases during optimization. The verbosity side effect observed in Jeppa-optimized prompts suggests that constrained optimization with length penalties may improve outcomes. Furthermore, the tension between optimization investment and model obsolescence requires organizations to evaluate optimization efforts based on expected lifespan of deployed models and anticipated performance improvements from next-generation foundation models.

The broader optimization landscape continues evolving, with future work targeting optimization across the full agent configuration space—model selection, prompt engineering, tool registration, compaction strategies, and code mode settings—simultaneously rather than in isolation. The Logfire roadmap includes autonomous agent optimization directly from the observability platform, enabling closed-loop improvement cycles where production feedback automatically triggers optimization runs. This vision of self-improving agents represents a natural evolution from manual optimization workflows to automated continuous improvement systems.

The availability of open-source and self-hosted Logfire options addresses privacy concerns for organizations working with sensitive data, enabling optimization workflows without cloud data exfiltration. This deployment flexibility proves essential for regulated industries and organizations with strict data governance requirements.

6. Conclusion

This analysis demonstrates that evolutionary prompt optimization using genetic algorithms combined with managed variables for production deployment provides a practical methodology for improving AI agent performance without costly model retraining. The Jeppa library's Pareto frontier selection approach achieves measurable improvements over expert human engineering, with documented accuracy gains from 92% to 96.7% on structured output tasks. The integration with Logfire's observability infrastructure and managed variables system enables organizations to iterate on agent configurations in production environments without code redeployment.

Key practical takeaways include: (1) systematic optimization proves most valuable for private data scenarios where context engineering is critical, (2) deterministic evaluation frameworks outperform LLM-as-judge approaches for structured output tasks, (3) cost optimization through combined prompt engineering and model downgrade strategies can achieve order-of-magnitude expense reductions, and (4) overfitting risk necessitates careful dataset partitioning and representative sampling during optimization. Organizations should evaluate optimization efforts based on total cost of ownership, expected model lifespan, and baseline performance of state-of-the-art models on their specific tasks.

Future research directions include multi-objective optimization across full agent configuration spaces, automated golden dataset generation from production feedback, and closed-loop autonomous optimization systems that continuously improve agent performance based on observability data. As the AI engineering discipline matures, systematic optimization methodologies like those presented here will become standard practice for production agent deployment.

Sources

Playground in Prod: Optimising Agents in Production Environments — Samuel Colvin, Pydantic - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub