Shipping AI That Works

AI PMs need real eval frameworks to ship reliable products — Arize, OpenTelemetry, LLM-as-judge, and how to move past vibe-coding into production.

2026-01-01 By Sean Weldon

Shipping AI That Works: A Product Manager's Guide to Evaluation Frameworks

TL;DR

AI product managers must move beyond prototype "vibe coding" to production-ready systems using systematic evaluation frameworks. Both OpenAI and Anthropic CPOs confirm their models hallucinate, making LLM-as-a-judge evaluations critical infrastructure. Modern PMs need structured observability, data-driven iteration, and eval datasets that serve as the new requirements documents for shipping reliable AI applications.

Key Takeaways

Product management expectations have increased step-function beyond traditional roles—PMs now deliver working prototypes and technical specifications rather than Google docs, while directly controlling prompt engineering for final product outcomes.
LLM systems require fundamentally different testing approaches than deterministic software—agents can be convinced 1+1=3 and execute multiple paths, requiring proprietary data-based evaluations rather than traditional integration tests on existing codebases.
Effective LLM-as-a-judge evaluations use text labels instead of numeric scores—LLMs perform poorly with numbers even at PhD-level tasks, so evaluators should output "toxic/not toxic" labels that map to scores post-evaluation.
Evaluations need their own evaluations through human validation—LLM judges can disagree with human labels nearly 0% of the time on subjective criteria like "friendliness," requiring iterative refinement with few-shot examples and stricter definitions.
Eval datasets function as the new product requirements documents—engineering teams receive acceptance criteria through evaluation datasets rather than traditional PRDs, accelerating development cycles from idea to production in a single day.

Why Do AI Product Managers Face a "Confidence Slump"?

The transition from prototype to production creates a confidence crisis for AI product managers. PMs can quickly build impressive demos using LLM APIs, but shipping reliable systems requires entirely new tooling and frameworks that most haven't learned.

Traditional product management skills don't translate directly to AI systems. The expectations have increased step-function compared to standard PM roles—teams now expect higher resolution requirements than Google docs can provide. PMs must deliver working prototypes and technical specifications while taking direct responsibility for prompt engineering and system reliability.

Both Kevin (OpenAI CPO) and Mike (Anthropic CPO) explicitly state their models hallucinate and evaluations are critical. When the people selling you the product tell you it's not reliable, you should listen. This acknowledgment from leaders representing 95% of LLM market share makes evaluation frameworks essential infrastructure rather than optional tooling.

How Are AI Evaluations Different from Traditional Software Testing?

Software operates deterministically—1+1 always equals 2. LLM agents are non-deterministic and can be convinced that 1+1=3 through prompt manipulation or context confusion.

Agent systems execute multiple paths rather than following single deterministic flows. Traditional unit tests verify one expected output for each input. Agent evaluations must account for various execution paths that could all be "correct" depending on context and reasoning approach.

Integration tests rely on your existing codebase to verify component interactions. Agent evaluations depend on your proprietary data—the specific use cases, edge cases, and domain knowledge unique to your application. The goal isn't eliminating all hallucinations but achieving "controlled creativity"—you want agents to hallucinate in the right way for your specific needs.

What Makes an Effective LLM-as-a-Judge Evaluation?

LLM-as-a-judge evaluations contain four essential components: role setting, task definition, context (marked in curly braces), and goal with explicit terminology and labels. These components give the evaluator LLM clear instructions for assessing outputs.

Use text labels instead of numeric scores. LLMs are "really bad at numbers" even at PhD-level mathematical tasks. An evaluator should output labels like "toxic/not toxic" or "helpful/unhelpful" rather than scores like 1-5. Map these text labels to numeric scores after evaluation if you need quantitative analysis.

LLM-as-a-judge provides the scalable approach for production evaluations. Code-based evaluators and human annotations supplement this framework but can't replace it at scale. The combination of automated LLM judging with strategic human validation creates reliable quality assurance for AI systems.

How Do You Instrument Multi-Agent Systems for Observability?

Multi-agent systems require comprehensive observability to understand execution flows. A trip planner example demonstrates this architecture: parallel agents handle budget, local experiences, and research, then feed results into an itinerary agent that synthesizes the final output.

Traces consist of spans—units of work with time components and type classifications. Each span falls into three categories: agent (decision-making), tool (structured data operations), or LLM (text generation). Agent visualization shows starting points, parallel execution paths, and data flow between components.

Single-line instrumentation captures everything automatically. The langchain_instrument() function uses the OpenTelemetry standard to log traces without manual code throughout your application. Traces include metadata like user_id and session_id beyond basic latency metrics, enabling filtering and analysis of specific user journeys or session patterns.

Why Should Product Managers Control Prompt Writing?

PMs should own prompt engineering because they're responsible for final product outcomes. Delegating prompts to engineers creates a disconnect between product vision and implementation. Prompt playgrounds should combine data and prompts in one interface for rapid iteration by product teams.

AB testing prompts across complete datasets reveals actual performance differences. Testing on single examples produces misleading results—you need systematic evaluation across representative data. Changing a prompt from verbose output to "500 characters or less" and "always offer discount for email" demonstrates how specificity impacts behavior measurably.

Development cycles have gotten "a lot faster" with proper tooling. Teams can move from idea to updated prompt to production deployment in a single day. PMs serve as "keepers of the end product experience" and ensure data quality for the development team, making prompt control a core PM responsibility.

What Are Datasets and How Do You Build Them?

Datasets are collections of examples in tabular format—essentially structured Google sheets for systematic evaluation. Each row contains an input example and potentially expected outputs or metadata for testing AI system responses.

Production traces can be pulled into datasets for offline experimentation. Synthetic data generation (like using Cursor to hit the same server repeatedly) bootstraps initial datasets when you lack production data. Continuously sample production data and add hard examples—borderline cases where the system struggles—to datasets over time.

The self-driving car analogy applies here: first solve straight roads, then left turns, then left turns with pedestrians. Incrementally build datasets of edge cases as you discover them. This approach mirrors how autonomous vehicle teams systematically expanded their capability envelope through targeted data collection.

How Do You Validate That Your Evaluators Are Accurate?

You need "evals for your evals"—human labels that validate LLM-as-a-judge accuracy. Code-based evaluators can check whether eval labels match human annotation labels, creating a meta-evaluation layer.

In one demo example, the LLM-as-judge disagreed with human labels on "friendliness" almost entirely—near 0% match rate. This complete misalignment revealed the eval prompt needed fundamental revision. When evals don't match human labels, iterate on the eval prompt with few-shot examples and stricter definitions of subjective criteria.

Temperature parameters reduce variance in LLM judge outputs. Rerun evaluations multiple times to profile variance and understand consistency. You cannot escape human validation—LLMs hallucinate, agents built on LLMs hallucinate, and LLM judges evaluating those agents also hallucinate. Human labels provide ground truth for the entire system.

What Does the Modern AI Development Loop Look Like?

The development cycle starts with CSV data and moves through systematic stages. PMs curate datasets, iterate experiments in prompt playgrounds, gain team confidence through evaluation results, ship to production, sample production data, and repeat the cycle.

Evals are "the new type of requirement doc" for AI teams. Instead of traditional PRDs, PMs give engineering teams eval datasets and acceptance criteria. Engineers know exactly what success looks like because they can run the evaluation and see pass/fail results immediately.

This workflow fundamentally changes team dynamics. Engineers receive concrete, testable requirements rather than ambiguous product descriptions. The eval dataset becomes the source of truth for both what the system should do and how to measure whether it does it correctly.

What the Experts Say

"We've got Kevin who's chief product officer at OpenAI. We have Mike at Anthropic CPO. This is probably 95% of the LLM market share. And both of the product leaders of those companies are telling you that their models hallucinate and that it's really important to write eval."

This acknowledgment from the leaders of the two dominant LLM providers signals that evaluations aren't optional—they're fundamental infrastructure for any AI application. When vendors explicitly warn about their product's limitations, evaluation frameworks become your primary defense against unreliable outputs.

"Imagine if you could go to your engineering team and instead of giving them a PRD, you give them an eval as requirements and here's the eval data set and here's the eval we want to use to test the system as an acceptance criteria."

This quote captures the paradigm shift in AI product management. Eval datasets provide concrete, executable specifications that eliminate ambiguity from traditional requirements documents. Engineers can immediately test their work against objective criteria rather than interpreting subjective product descriptions.

Frequently Asked Questions

Q: What's the difference between deterministic software testing and AI evaluation?

Traditional software produces predictable outputs—1+1 always equals 2. AI agents are non-deterministic and can be convinced 1+1=3 through prompts. Agent systems execute multiple valid paths rather than single flows, requiring evaluation approaches based on proprietary data rather than codebase integration tests.

Q: Why should evaluators use text labels instead of numeric scores?

LLMs perform poorly with numbers even at PhD-level tasks. Evaluators should output text labels like "toxic/not toxic" rather than 1-5 scores. You can map these text labels to numeric scores after evaluation if quantitative analysis is needed, but the LLM judge should work with categorical labels.

Q: How do you know if your LLM-as-a-judge evaluator is accurate?

Run human validation on a sample of evaluations to compare LLM judge outputs with human labels. Code-based evaluators can automatically check match rates. In practice, LLM judges can show near 0% agreement with humans on subjective criteria, requiring iterative refinement with few-shot examples and stricter definitions.

Q: What is a span in AI system tracing?

A span is a unit of work with a time component and type classification. Spans fall into three categories: agent (decision-making), tool (structured data operations), or LLM (text generation). Multiple spans combine to form traces that show complete execution flows through multi-agent systems.

Q: How do you build datasets when you don't have production data yet?

Use synthetic data generation to bootstrap initial datasets—tools like Cursor can repeatedly hit your server to generate examples. Start with CSV files of expected inputs and outputs. Once in production, continuously sample real data and add hard examples (borderline cases) to expand dataset coverage.

Q: Why should product managers write prompts instead of engineers?

PMs are responsible for final product outcomes and must control the primary interface to AI behavior. Prompt playgrounds should combine data and prompts for PM-led iteration. Development cycles can compress to a single day when PMs directly modify prompts, test against datasets, and deploy without handoff delays.

Q: What does it mean to "hallucinate in the right way"?

The goal isn't eliminating all hallucinations but achieving controlled creativity. AI systems should generate novel outputs within acceptable boundaries for your use case. Complete factual accuracy isn't always necessary—sometimes creative interpretation or synthesis provides more value than rigid fact retrieval.

Q: How has the AI development cycle changed with proper evaluation tooling?

Teams can move from idea to updated prompt to production in a single day with systematic evaluation frameworks. Evals serve as executable requirements that eliminate ambiguity. Engineers receive concrete acceptance criteria through eval datasets rather than traditional PRDs, accelerating iteration and reducing miscommunication.

The Bottom Line

Shipping reliable AI products requires systematic evaluation frameworks that match the non-deterministic nature of LLM systems. The confidence slump between prototype and production disappears when PMs adopt structured observability, LLM-as-a-judge evaluations validated by human labels, and eval datasets that serve as executable requirements.

The paradigm shift is clear: evals are emerging as a real moat for AI startups. Traditional software testing approaches fail for agent systems that execute multiple paths and require proprietary data for validation. Modern AI PMs must own prompt engineering, build comprehensive datasets, and treat evaluations as critical infrastructure rather than optional tooling.

Start by instrumenting your AI system with OpenTelemetry-based tracing to capture execution flows. Build initial datasets from synthetic data or early production samples. Create LLM-as-a-judge evaluations with text labels, validate them against human annotations, and iterate until accuracy matches expectations. Give your engineering team eval datasets as requirements and watch development cycles compress from weeks to days.

Sources

Shipping AI That Works - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub