Your Coding Agent Should Do AI System Engineering — Ben Burtenshaw, Hugging Face

Coding agents can tackle complex AI systems engineering problems like writing CUDA kernels and automating ML research, but require open, distributable primit...

By Sean Weldon

Abstract

This paper examines the application of autonomous coding agents to complex AI systems engineering challenges, specifically focusing on CUDA kernel optimization, model fine-tuning, and automated machine learning research. The analysis presents three progressive levels of agent autonomy and demonstrates that agents can successfully execute verifiable engineering tasks when provided with appropriate infrastructure primitives. Key findings include a 94% performance improvement in GPU kernel optimization for Qwen 3 8B on H100 hardware and the successful implementation of a multi-agent research system capable of autonomous hypothesis generation and testing. The work emphasizes that agent effectiveness depends critically on access to open, distributable primitives rather than abstracted APIs, with the Hugging Face Hub providing essential capabilities in storage, tracking, and compute resources necessary for scaling agent-driven workflows.

1. Introduction

The rapid advancement of coding agents—autonomous systems capable of generating, testing, and optimizing software—has reached a critical inflection point in adoption across the AI engineering community. As these agents become increasingly capable, a fundamental question emerges regarding their optimal application domains and the infrastructure requirements necessary to support their deployment at scale.

This synthesis examines the thesis that coding agents can effectively address complex AI systems engineering problems, including low-level hardware optimization through CUDA kernel development and high-level automated machine learning research. The analysis identifies three distinct levels of agent autonomy: hybrid interactive approaches that combine human oversight with agent execution, zero-shot task completion where agents independently fulfill specified objectives, and multi-agent systems that conduct autonomous research through coordinated collaboration.

The investigation focuses on three progressively complex engineering challenges, termed "bosses" in reference to their difficulty: writing optimized CUDA kernels for GPU acceleration, executing model fine-tuning workflows, and implementing distributed multi-agent research systems. Each challenge reveals specific requirements for agent infrastructure and highlights the importance of open primitives in enabling agent capabilities. The central thesis posits that as coding agents mature, engineering practitioners must address increasingly fundamental problems closer to hardware-level operations, with agents serving as force multipliers for tackling AI systems engineering complexity.

2. Background and Related Work

2.1 GPU Performance Characteristics and Optimization

Modern GPU architectures present a fundamental performance paradox that shapes optimization strategies. The H100 GPU exemplifies this challenge, delivering petaflop-scale computational throughput while providing only 3TB/s memory bandwidth. This architectural disparity creates scenarios where memory bandwidth rather than computational capacity typically constrains performance, resulting in GPU cores remaining idle while awaiting tensor data transfers.

Arithmetic intensity—the ratio of computational operations to memory access operations—emerges as the critical metric for GPU optimization. Custom CUDA kernels address memory bottlenecks by maximizing operations performed per memory read/write cycle, a principle embodied in Flash Attention, which represents a canonical example of this optimization pattern. The challenge of kernel optimization extends beyond algorithmic design to encompass distribution complexities, as kernel installation requires managing intricate compatibility matrices spanning hardware configurations, software versions, CUDA versions, and GPU generations.

2.2 Agent Autonomy and Research Frameworks

The progression from single-agent to multi-agent systems reflects increasing sophistication in autonomous research capabilities. Karpathy's Auto Research project established a foundation for single-agent iterative improvement, wherein a solitary agent generates hypotheses, implements experiments, and refines approaches based on results. The AutoLab framework extends this paradigm through distributed multi-agent collaboration, enabling parallel hypothesis exploration and specialized role assignment across distinct agent types.

Infrastructure frameworks supporting agent operations have evolved to emphasize transparency and accessibility. The Skills framework introduces file-based context that transforms zero-shot tasks into few-shot scenarios by providing agents with concrete examples—a familiar paradigm from machine learning transferred to agent orchestration. The OpenCode framework provides configuration mechanisms for agent definition, including skill specifications, prompt templates, and sub-agent role assignment.

3. Core Analysis

3.1 CUDA Kernel Optimization Through Agent-Driven Development

The first engineering challenge demonstrates agent capability in low-level hardware optimization through CUDA kernel generation. Agents successfully produce both valid and optimized CUDA kernels, as evidenced in GPU mode hackathons and kernel benchmarking exercises. The practical application to Qwen 3 8B optimization on H100 hardware achieved a 94% performance improvement, primarily through compatibility-focused enhancements rather than algorithmic innovations.

The optimization process leverages the Hugging Face Kernels library, which distributes kernels via TOML configuration files specifying hardware compatibility and CUDA version requirements. This distribution mechanism addresses the fundamental challenge that kernel installation represents a complex matrix of dependencies spanning multiple hardware and software dimensions. The Skills framework plays a crucial role by providing agents with file-based context containing examples of kernel writing and usage patterns, effectively converting zero-shot kernel generation tasks into few-shot scenarios with concrete reference implementations.

The Upskill library introduces a comparative evaluation framework enabling systematic assessment of different models executing identical skills. This capability allows practitioners to optimize cost-performance tradeoffs by empirically measuring model accuracy and token efficiency across kernel generation tasks. The framework reveals that agent effectiveness in kernel optimization stems not merely from code generation capabilities but from understanding the relationship between memory access patterns and computational efficiency—specifically, the principle of keeping GPUs computationally active while minimizing idle time during memory operations.

3.2 Zero-Shot Model Fine-Tuning Workflows

The second engineering challenge examines agent capability in executing complete model fine-tuning workflows without human intervention. Agents successfully perform zero-shot fine-tuning tasks, demonstrated through training Qwen 3 6B on chain-of-thought datasets. This capability integrates multiple infrastructure components: GPU access provisioning, Hugging Face CLI operations, and training orchestration.

The fine-tuning workflow demonstrates full integration with Hugging Face Hub infrastructure, providing agents with direct access to computational resources and model repositories. Integration with Onslaught extends capabilities through optimized model variants and cost-reduced training options. The success of zero-shot fine-tuning reveals that agents can orchestrate complex multi-step workflows when provided with appropriate primitives, moving beyond simple code generation to complete task execution including resource allocation, training monitoring, and artifact management.

3.3 Multi-Agent Automated Research Systems

The most sophisticated engineering challenge implements AutoLab, a distributed multi-agent research system that advances beyond single-agent iterative improvement through role specialization and parallel exploration. The architecture comprises four distinct agent types, each with specialized responsibilities: the Researcher agent searches academic literature via HF Papers, the Planner agent maintains a job queue and coordinates work distribution, Worker agents implement hypotheses as executable training scripts, and the Reporter agent monitors job execution and maintains experimental dashboards.

Agents operate within Git repositories where the main branch contains training scripts and data structures tracking experimental scores. This repository-centric architecture enables version control of agent-generated code and provides a shared state mechanism for coordination. Worker agents integrate with HF Jobs to execute training on Hub infrastructure, submitting patches back to repositories upon completion. This distributed execution model enables parallel hypothesis testing, with agents working simultaneously on different experimental approaches over extended periods.

The Trackio open-source dashboard provides critical infrastructure through a completely open parquet data layer, allowing both agents and humans to access raw metrics without requiring dashboard mediation. This design choice proves essential for agent operation, as it eliminates API abstraction layers that could constrain agent capabilities. Trackio supports events, warnings, notifications, and custom visualizations including Gantt charts for tracking parallel execution, without imposing specific dashboard structures on agent-generated data.

Agent communication occurs through structured tables and data structures rather than natural language, enabling precise coordination. Configuration templates incorporate current state, job history, successful experiments, and hyperparameters, providing agents with comprehensive context for decision-making. The use of HF bucket storage maintains all artifacts in a single location, avoiding repeated upload/download operations that would impede distributed workflows.

4. Technical Insights

4.1 Infrastructure Requirements for Agent Scalability

The analysis reveals that agent effectiveness depends fundamentally on infrastructure characteristics rather than solely on model capabilities. Three essential infrastructure components emerge as requirements for scaling agent-driven workflows: storage systems for artifact persistence, tracking mechanisms for experiment monitoring, and compute resources for execution. The Hugging Face Hub provides integrated implementations of these components, enabling agents to operate without requiring complex infrastructure orchestration.

A critical architectural principle emerges regarding API design for agent systems: abstracted APIs create capability ceilings that limit agent effectiveness, whereas exposure of underlying primitives enables agents to compose novel workflows. This principle contradicts conventional API design wisdom that emphasizes abstraction, suggesting that systems designed for agent operation require fundamentally different interface characteristics than those designed for human developers. The observation that "agents work really well with primitives and open primitives" indicates that full control exposure proves more valuable than extraction layers in agent contexts.

4.2 Efficiency Considerations in Deep Learning Systems

The analysis identifies three distinct categories of efficiency in deep learning systems: compute efficiency measured in FLOPs, memory efficiency concerning data movement latency, and overhead efficiency encompassing Python environment and PyTorch dispatch costs. Agent-driven optimization addresses all three categories, though memory efficiency typically provides the largest optimization opportunity given hardware characteristics of modern GPUs.

The verification of agent-generated work proves relatively straightforward for certain task categories. Model training and CUDA kernel writing represent verifiable experiments where correctness and performance can be objectively measured, enabling automated evaluation without human judgment. This characteristic makes these domains particularly suitable for agent operation, as agents can autonomously validate their outputs through empirical testing.

5. Discussion

The findings demonstrate that coding agents have progressed beyond simple code generation to tackle genuine systems engineering challenges requiring multi-step reasoning, hardware understanding, and workflow orchestration. The 94% performance improvement in kernel optimization and successful autonomous research system implementation provide empirical evidence that agents can deliver measurable value in production engineering contexts.

The emphasis on open primitives over abstracted APIs suggests a broader principle for infrastructure design in the age of autonomous systems. Traditional software engineering prioritizes abstraction to reduce complexity and prevent errors, but agent systems may benefit from opposing design principles that maximize flexibility and composability. This tension between human-oriented and agent-oriented design philosophies represents an important consideration for infrastructure developers.

Several limitations and areas for future investigation emerge from this analysis. The focus on verifiable experiments—tasks with objective success criteria—raises questions about agent applicability to less structured engineering problems. The multi-agent research system demonstrates capability in hypothesis generation and testing, but the analysis does not address how agents handle ambiguous requirements or creative problem-solving scenarios lacking clear verification mechanisms. Additionally, the cost-benefit analysis of agent-driven development remains underexplored, particularly regarding the computational costs of agent operation versus human engineering time.

The integration of agents with existing development workflows presents both opportunities and challenges. The Git-centric architecture of AutoLab demonstrates one approach to incorporating agent-generated code into version control systems, but questions remain regarding code review processes, testing standards, and maintenance responsibilities for agent-generated artifacts. As agents become more capable, the boundary between human and agent contributions may blur, necessitating new practices for attribution, accountability, and quality assurance.

6. Conclusion

This analysis demonstrates that coding agents can effectively address complex AI systems engineering challenges when provided with appropriate infrastructure primitives. The progression from CUDA kernel optimization through model fine-tuning to multi-agent research systems illustrates increasing levels of agent autonomy and capability. Key contributions include empirical validation of agent-driven performance optimization, identification of infrastructure requirements for agent scalability, and articulation of design principles emphasizing open primitives over abstracted APIs.

Practical takeaways for practitioners include the recognition that verifiable engineering tasks represent optimal domains for initial agent deployment, the importance of infrastructure supporting storage, tracking, and compute in integrated form, and the value of file-based context mechanisms for transforming zero-shot tasks into few-shot scenarios. The success of distributed multi-agent systems suggests that future development will increasingly involve coordinated autonomous agents working in parallel on complex problems, with human engineers focusing on problem formulation and result validation rather than implementation details. As agents continue to mature, the engineering community must adapt by addressing increasingly fundamental problems closer to hardware operations, leveraging agents as force multipliers for tackling AI systems engineering complexity at scale.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub