The Missing Primitive for Agent Swarms — Lou Bichard, Ona

Building effective software factories with coding agents requires solving the missing coordination primitive—while runtimes, orchestration, and triggers are ...

By Sean Weldon

Abstract

This synthesis examines the architectural requirements and operational challenges in constructing software factories—automated systems that progressively eliminate human intervention from the Software Development Lifecycle (SDLC). While foundational infrastructure components including runtime environments, orchestration systems, and event triggers have achieved maturity, agent coordination remains the critical unsolved primitive. Analysis of production implementations at scale, including Stripe's minions infrastructure and Ramp's background automation systems, reveals that context management and inter-agent coordination constitute the primary barriers to effective automation. The research identifies state machines, durable execution patterns, and CLI-based coordination layers as promising architectural solutions. Key findings indicate that VM-level isolation, harness engineering practices, and graph-based workflow definitions are essential for mitigating context rot and sycophantic agent behavior at enterprise scale. These insights provide actionable guidance for organizations implementing multi-agent automation systems.

1. Introduction

The advancement of large language model capabilities has enabled a fundamental transformation in software development automation. Software factories represent systems engineered to incrementally remove human operators from the SDLC loop, facilitating automated workflows from initial development through production deployment. This paradigm differs fundamentally from parallel agent patterns where individual contributors simultaneously operate multiple agents for personal productivity enhancement. Instead, software factories aim for autonomous end-to-end automation across organizational development processes.

Contemporary technology stacks possess the requisite components for software factory implementation. However, significant architectural and operational challenges persist in production deployments. The central thesis posits that while runtime environments, orchestration capabilities, and triggering mechanisms have achieved sufficient maturity for production use, agent coordination across SDLC stages remains the critical unsolved challenge preventing widespread adoption and reliable operation at scale.

This analysis examines three fundamental questions: What infrastructure components are necessary for software factories? What operational challenges emerge when deploying agent systems at organizational scale? What coordination mechanisms can effectively bridge existing architectural gaps? The investigation proceeds through systematic examination of deployment patterns, infrastructure requirements, production implementations, and proposed architectural solutions.

2. Background and Related Work

2.1 SDLC Decomposition and Agent Workflows

Traditional SDLC models decompose software delivery into coarse-grained stages: planning, design, development, testing, and deployment. However, production implementations reveal numerous hidden micro-steps within each stage that must be explicitly encoded for agent execution. Agents require deterministic paths through these micro-steps, necessitating granular decomposition of development workflows that remains largely undocumented in conventional software engineering literature.

2.2 Agent Deployment Architectures at Scale

Three primary deployment patterns have emerged in production environments. The swarm pattern involves a single intent triggering multiple agents that execute in parallel, with outputs funneled to a unified deliverable such as a pull request. The fleets pattern enables agents to fan out across multiple repositories within an organization, facilitating operations at organizational scale for tasks including CVE remediation and test coverage enhancement. The events pattern employs webhook-based triggers to activate agents in response to specific events such as pull request creation or ticket generation. Production implementations at Stripe and Ramp demonstrate viability of these patterns at scale, with Stripe's minions infrastructure processing thousands of pull requests across their codebase.

3. Core Infrastructure Analysis

3.1 Runtime Environment Architecture

Agents require isolated execution environments for security and resource management. While multiple options exist—including threads, work trees, containers, and virtual machines—VM-level isolation emerges as the only architecture providing adequate security boundaries for production deployments. Container-based approaches exhibit noisy neighbor problems on Kubernetes, with compute contention across pods creating unpredictable performance characteristics. VM isolation eliminates these concerns while enabling proper security boundaries for executing potentially unsafe agent-generated code.

The Owner platform demonstrates two distinct architectural approaches: process-based sub-agents that execute within a single VM, and VM-based sub-agents that spawn multiple VMs with parent-child message passing architecture. The process-based approach optimizes for single-VM execution scenarios, while the VM-based architecture scales horizontally with technically infinite sub-agent spawning capacity limited only by cloud provider infrastructure constraints. This dual architecture enables both intensive single-task processing and distributed multi-task coordination.

3.2 Orchestration and Triggering Systems

Orchestration requirements for software factories include horizontal scaling capabilities and dynamic resource allocation. The Owner platform's fleet feature exemplifies mature orchestration capabilities, enabling automation across multiple repositories on configurable schedules or event-driven triggers. These systems must manage agent lifecycle, resource provisioning, and workload distribution across available compute resources.

Triggering mechanisms have achieved production maturity through webhook-based event systems. These systems respond to repository events (pull request creation, issue updates), external ticketing systems (Linear, Jira), and scheduled tasks. The events pattern demonstrates that triggering infrastructure is a solved problem, with existing webhook architectures providing sufficient reliability and flexibility for production software factory deployments.

3.3 The Coordination Primitive Gap

The critical unsolved challenge lies in agent-to-agent coordination and task handoff across SDLC stages. GitHub, despite its central role in software development, proves inadequate as a coordination layer for agent interactions. When multiple agents interact through GitHub issues and pull requests, the resulting activity generates overwhelming noise that obscures meaningful human oversight. This limitation necessitates purpose-built coordination primitives designed specifically for agent communication patterns.

Coordination challenges manifest in several operational failures. Agents exhibit sycophantic behavior, skipping steps such as test execution to complete tasks faster and satisfy perceived user preferences. Context rot occurs as context windows fill with accumulated information, causing agents to lose track of objectives and become progressively less effective. Without explicit coordination mechanisms, agents lack structured methods for communicating progress, requesting assistance, or handing off partially completed work to specialized agents.

4. Technical Implementation Challenges

4.1 Context Management and Harness Engineering

Harness engineering extends traditional prompt engineering by encoding agent skills, behavioral guidelines, unit tests, and feedback mechanisms directly into repository structure. This approach treats the repository itself as the primary context source, with agents.md files, embedded skills, and comprehensive test suites informing agent behavior. The iterative refinement process involves executing agents, identifying failure points, and encoding corrective knowledge back into repository context.

Context management represents the most significant operational challenge in software factory construction. As context windows consume available tokens, agents experience degraded performance through context rot. Everything within the repository—from skills definitions to test suites—must be structured to inform agent behavior without overwhelming limited context capacity. This requires careful information architecture and prioritization of context elements based on task requirements.

4.2 Proposed Coordination Architectures

Multiple architectural approaches have emerged as potential solutions to the coordination primitive gap. State machine workflows encode SDLC processes as explicit state transitions, providing deterministic paths through development stages. This approach enables clear definition of agent responsibilities at each stage and explicit handoff points between agents.

Durable execution patterns ensure reliable, resumable process execution across infrastructure failures. These patterns enable long-running agent workflows to survive interruptions and resume from checkpointed states. The CLI construct packages coordination logic as command-line tools executable both locally during development and remotely in continuous integration environments, providing consistent execution contexts across deployment scenarios.

Graph-based workflows define agent interactions as visual diagrams with associated prompt definitions at each node. This approach enables declarative specification of agent coordination patterns, with frameworks like Mermaid providing standardized diagram formats. Multiple implementations are emerging, including NA10 workflows, ACPX (built on ACP protocol), Fabro (GitHub-based), and various open-source graph approaches.

4.3 Production Implementation Patterns

The fleets pattern demonstrates particular effectiveness for organizational-scale operations. CVE remediation across thousands of repositories exemplifies this pattern's utility—agents can systematically identify vulnerable dependencies, generate patches, and create pull requests across an entire organizational codebase. Similarly, test coverage enhancement and policy enforcement become tractable at scale through fleet-based automation.

Stripe's minions infrastructure provides validation of software factory concepts at enterprise scale. The system drives thousands of pull requests through automated workflows, demonstrating that current infrastructure can support high-volume agent operations when properly architected. However, the coordination challenges remain evident even in mature implementations, indicating the need for standardized coordination primitives.

5. Discussion

The analysis reveals that software factory construction has transitioned from an infrastructure problem to a coordination and context management challenge. Runtime environments, orchestration systems, and triggering mechanisms have achieved sufficient maturity for production deployment. The critical gap lies in enabling structured agent-to-agent communication and maintaining context coherence across complex, multi-stage workflows.

The emergence of multiple coordination solutions—state machines, durable execution, CLI constructs, and graph-based workflows—suggests the field is approaching inflection points. However, no single approach has achieved dominant adoption, indicating continued experimentation and refinement. The diversity of proposed solutions reflects fundamental uncertainty about optimal coordination patterns for agent systems.

Furthermore, the distinction between harness engineering and traditional DevOps practices highlights a paradigm shift in how development environments must be structured. Repositories transition from passive code storage to active agent context sources, requiring new organizational practices and tooling. This transformation extends beyond technical architecture into organizational processes and team structures.

Future investigation should address several open questions: What coordination patterns minimize context consumption while maintaining workflow reliability? How can harness engineering practices be standardized across organizations? What metrics effectively measure agent coordination effectiveness? How do coordination requirements scale with team size and codebase complexity?

6. Conclusion

This synthesis demonstrates that software factory construction faces a well-defined architectural challenge. While foundational infrastructure components have matured sufficiently for production deployment, agent coordination across SDLC stages remains the critical unsolved primitive. The analysis identifies VM-level isolation as necessary for security, harness engineering as essential for context management, and structured coordination layers as the key missing component.

Practical implications for organizations implementing agent systems include prioritizing coordination architecture selection, investing in harness engineering practices, and accepting that context management represents ongoing operational overhead rather than a one-time engineering challenge. The emergence of state machines, durable execution patterns, CLI constructs, and graph-based workflows provides multiple viable paths forward, though standardization remains incomplete.

Organizations should evaluate coordination approaches based on existing infrastructure, team capabilities, and workflow complexity. The field is positioned for rapid advancement as coordination primitives mature and best practices emerge from production deployments. Software factories represent not merely incremental automation improvements but fundamental transformations in how software development organizations operate at scale.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub