Does GenAI "belong" to data scientists? — Phil Hetzel, Braintrust

Agentic development should not be exclusively owned by data scientists and ML engineers, but rather requires a diverse team combining technical experts (prod...

2026-05-30 By Sean Weldon

Organizational Ownership of Agentic AI Development: Beyond Data Science Silos

Abstract

The deployment of agentic AI systems presents a critical organizational challenge regarding team composition and technical ownership. This analysis examines the thesis that agentic development requires cross-functional collaboration rather than exclusive ownership by data scientists and machine learning engineers. Through comparative examination of traditional enterprise and AI-native organizational structures, this synthesis demonstrates that generative AI development fundamentally diverges from classical machine learning pipelines, as foundation models arrive pre-trained and value creation occurs through prompt and context engineering rather than feature engineering and model retraining. The findings indicate that optimal team composition integrates product and systems engineers for implementation, data scientists for evaluation pipeline validation and guardrail establishment, and non-technical domain experts for prompt engineering and human annotation workflows. This collaborative approach addresses the broader functional performance requirements of agentic systems while maintaining appropriate technical rigor and safety constraints.

1. Introduction

Organizations have demonstrated considerable proficiency in creating generative AI proof-of-concept applications, yet a substantial implementation gap persists between prototype development and production deployment. This challenge derives partially from organizational ambiguity regarding appropriate team composition and ownership structures for agentic development initiatives. Traditional enterprises typically assign such projects to established machine learning and data science teams, leveraging existing infrastructure and accumulated expertise. Conversely, AI-native companies construct small, cross-functional engineering teams without pre-existing ML organizational hierarchies, resulting in what has been characterized as superior "proximity to the problem" through comprehensive understanding of the agent's intended function.

Agentic development refers to the construction of autonomous AI systems capable of executing complex, multi-step tasks through reasoning, tool utilization, and dynamic planning. Unlike traditional software engineering or classical machine learning, agentic systems employ large language models (LLMs) as foundational components, fundamentally restructuring the development lifecycle and requisite skill sets. The central question examined in this analysis is: which organizational roles should own agentic development, and what team composition optimizes both technical rigor and practical effectiveness?

This synthesis proceeds by establishing the fundamental distinctions between traditional ML and generative AI development paradigms, evaluating arguments supporting and opposing exclusive data scientist ownership, examining the capabilities non-data science roles contribute, and proposing an integrated team structure that balances technical expertise with domain knowledge proximity.

2. Background and Related Work

2.1 Traditional Machine Learning Development Pipeline

Classical ML development adheres to an established sequence: data collection, feature engineering, model training, cross-validation, testing, and deployment. Value creation occurs primarily through feature engineering—the transformation of raw data into predictive variables—and iterative model retraining cycles. Data scientists employ rigorous validation frameworks including k-fold cross-validation, A/B testing, and continuous monitoring of performance metrics such as precision, recall, and F1 scores. This pipeline necessitates deep statistical knowledge, expertise in model architecture selection, and disciplined processes for deploying model assets to production environments.

2.2 Generative AI Development Paradigm Shift

Generative AI development diverges fundamentally from traditional ML workflows. Foundation models are pre-trained by organizations including Anthropic, OpenAI, and Mistral, eliminating the training phase entirely from the developer's workflow. As one industry observer noted, "The model's already built. So much of what data scientists and machine learning engineers is going through that data pipeline of training a model. What do we do when the model's already built?" LLMs function as application programming interfaces (APIs) that accept natural language inputs and generate outputs based on learned patterns. Consequently, value creation shifts from feature engineering to prompt engineering (crafting effective instructions) and context engineering (providing relevant information to guide model behavior). This paradigm shift enables value addition "not necessarily with feature engineering, but with natural language, which could bring in a different skill set to the conversation."

2.3 Agent Quality Infrastructure

Modern agentic development relies on specialized infrastructure built on two foundational pillars: evals (experimentation and confidence-building through systematic evaluation) and observability (production monitoring and performance tracking). These platforms enable continuous feedback loops where production data is gathered to update offline evaluation datasets, enabling drift detection between evaluator agreement and actual system performance. This infrastructure supports both technical validation and domain-specific performance assessment across the broader functional surface area that agentic systems must address.

3. Core Analysis

3.1 Organizational Approaches and Problem Proximity

Two distinct organizational patterns have emerged in agentic development. Traditional enterprises delegate agent development to existing ML and data science platform teams, leveraging established organizational structures and tooling investments. AI-native companies, conversely, construct entire product offerings around agents using small, cross-functional engineering teams without pre-existing ML infrastructure. The critical differentiator is problem proximity: AI-native teams demonstrate superior understanding of "what the end agent is meant to solve" because cross-functional integration spans both product engineering and AI engineering rather than maintaining specialty-based silos.

This proximity advantage manifests in more effective prompt engineering, context selection, and evaluation criteria definition. When teams maintain organizational distance from the problem domain, evaluation frameworks risk fixating on traditional ML metrics rather than functional performance requirements. The structural integration of domain expertise within development teams enables more accurate specification of success criteria and failure modes.

3.2 Arguments Supporting Data Scientist Involvement

Data scientists contribute three critical capabilities to agentic development initiatives. First, their deep understanding of neural networks and LLM architectures provides superior appreciation of inherent risks associated with complex, probabilistic technology. This expertise enables more realistic assessment of model capabilities and limitations, preventing overly aggressive implementations that exceed current technical boundaries.

Second, data scientists bring rigorous processes for deploying model assets to production and implementing comprehensive testing frameworks. This disciplined mindset around validation and testing protocols helps maintain end-user safety and system reliability. Third, data scientists provide essential validation for LLM-as-judge evaluation approaches by creating labeled datasets and applying traditional precision, recall, and F1 metrics to evaluation processes themselves. This meta-evaluation prevents blind trust in model-based judgments and ensures evaluation frameworks maintain accuracy over time.

Additionally, data scientists add substantial technical value when fine-tuning open-source models for specific use cases, a scenario where traditional ML expertise directly applies to generative AI contexts.

3.3 Limitations of Exclusive Data Scientist Ownership

Despite these contributions, several factors argue against exclusive data scientist ownership of agentic development. The pre-built nature of foundation models eliminates the need for traditional training, testing, and cross-validation workflows that constitute core data science competencies. As observed in practice, "Does an ML engineer or a data scientist really know what they're testing for? They will really lock on to the traditional ML engineer metrics precision, recall, F1, and they'll obsess over those metrics because that is what has gotten them there up to that point."

However, agent evaluation requires assessment across a substantially broader functional performance surface area. Technical metrics alone prove insufficient for evaluating whether agents successfully accomplish intended tasks, interact appropriately with users, or handle edge cases gracefully. Data scientists may lack the domain expertise necessary to define comprehensive evaluation criteria or recognize subtle failure modes in agent behavior.

Furthermore, the statistical and mathematical backgrounds typical of data science training may not prepare practitioners for the distributed systems engineering challenges inherent in complex agent architectures. Supervisor agents calling sub-agents across different infrastructure components represent systems engineering problems requiring different skill sets than traditional ML development.

3.4 Contributions from Product Engineers and Domain Experts

Product engineers contribute directly applicable skills to agentic development. As noted, "LLMs are just APIs. Product engineers are very used to using APIs as they build applications." This existing expertise in API integration, distributed systems architecture, and production software engineering translates directly to LLM implementation contexts. Complex agent architectures with multiple components communicating across infrastructure boundaries align more closely with systems engineering competencies than statistical modeling expertise.

Non-technical domain experts provide irreplaceable value through three mechanisms. First, subject matter experts and product managers possess optimal proximity to the problems agents are designed to solve, enabling them to craft more effective prompts and select appropriate context. Second, these experts can perform human annotation workflows, evaluating agent traces to determine performance quality and identify failure reasons with domain-specific judgment that technical metrics cannot capture. Third, domain experts can experiment directly with prompts using agent playground environments, iterating on natural language instructions without requiring programming expertise.

This democratization of prompt engineering enables those with deepest problem understanding to directly influence agent behavior, rather than translating requirements through technical intermediaries who may lack nuanced domain knowledge.

4. Technical Insights

4.1 Evaluation Architecture and Validation

Robust agentic systems require evaluation frameworks that extend beyond traditional ML metrics to encompass functional performance assessment. While precision, recall, and F1 scores provide useful signals, comprehensive evaluation must assess task completion, interaction quality, error handling, and domain-specific success criteria. Implementation of LLM-as-judge evaluation requires careful validation: data scientists should create labeled datasets and apply traditional metrics to the evaluation process itself, ensuring evaluator accuracy and detecting drift between evaluator judgments and ground truth.

4.2 Production Feedback Loops

Effective agentic systems implement continuous improvement through production-to-experimentation feedback loops. Production data is systematically gathered to expand offline evaluation datasets, enabling grounded data collection for self-checking mechanisms. This infrastructure must distinguish between evaluator drift (changes in how performance is assessed) and actual system performance degradation (changes in agent behavior quality). Without this distinction, organizations risk misinterpreting evaluation signals and implementing counterproductive modifications.

4.3 Optimal Team Composition

The technical evidence supports a tripartite team structure. Product, application, and systems engineers should own implementation, leveraging their expertise in API integration and distributed systems architecture. Data scientists should own evaluation and observability pipeline development, validate LLM-as-judge approaches, establish guardrails based on LLM limitations, and lead fine-tuning initiatives when required. Non-technical domain experts should drive human annotation, prompt engineering, and context engineering, maintaining closest proximity to problem domains and functional requirements.

This structure distributes responsibilities according to comparative advantage: technical implementation to those with systems engineering expertise, evaluation rigor to those with statistical validation skills, and behavioral specification to those with deepest domain knowledge.

5. Discussion

The findings presented demonstrate that effective agentic development requires reconceptualizing team composition beyond traditional ML organizational structures. The paradigm shift from model training to prompt and context engineering fundamentally alters the skill sets that create value, expanding relevant expertise beyond statistical modeling to encompass systems engineering, domain knowledge, and natural language instruction crafting.

This analysis reveals a broader trend in AI development: as foundation models mature and become increasingly accessible through API interfaces, competitive advantage shifts from model architecture expertise toward effective integration, evaluation, and domain-specific adaptation. Organizations that recognize this transition and restructure teams accordingly gain advantages in problem proximity, implementation velocity, and functional performance.

Several areas warrant further investigation. The optimal balance between technical rigor and domain expertise likely varies across application domains, with safety-critical applications requiring greater data scientist involvement than lower-risk use cases. Additionally, the effectiveness of non-technical experts in prompt engineering may depend on the complexity of required instructions and the sophistication of agent architectures. Future research should examine how team composition requirements scale with system complexity and domain criticality.

The emergence of specialized agent quality platforms suggests an evolving infrastructure landscape that may further democratize agentic development. As evaluation and observability tooling matures, the technical barriers to effective agent deployment may decrease, potentially expanding the range of organizational roles capable of contributing to agentic initiatives.

6. Conclusion

This analysis demonstrates that agentic development should not reside exclusively within data science and machine learning engineering organizations. The fundamental differences between traditional ML pipelines and generative AI development—particularly the pre-trained nature of foundation models and the shift from feature engineering to prompt and context engineering—necessitate cross-functional team structures that integrate diverse expertise.

The optimal approach combines product and systems engineers for implementation, data scientists for evaluation validation and technical guardrails, and non-technical domain experts for prompt engineering and functional assessment. This structure maximizes problem proximity while maintaining technical rigor, addressing both the systems engineering challenges of distributed agent architectures and the domain-specific performance requirements that purely technical metrics cannot capture.

Organizations seeking to deploy agentic systems effectively should audit current team compositions, identify gaps in domain expertise or systems engineering capabilities, and establish collaborative workflows that enable non-technical experts to directly influence agent behavior through prompt engineering. By distributing ownership according to comparative advantage rather than traditional organizational boundaries, enterprises can bridge the gap between proof-of-concept proliferation and production deployment success.

Sources

Does GenAI "belong" to data scientists? — Phil Hetzel, Braintrust - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub