Full Walkthrough: Workflow for AI Coding - Matt Pocock

Software engineering fundamentals and best practices remain crucial when working with AI; the key is understanding LLM constraints (smart/dumb zones, context...

2026-04-27 By Sean Weldon

Engineering Workflows for LLM-Assisted Software Development: A Framework for Human-AI Collaboration

Abstract

This research synthesis examines software engineering methodologies optimized for Large Language Model (LLM) integration in development workflows. The central thesis posits that traditional software engineering fundamentals remain essential when working with AI agents, but require systematic adaptation to accommodate LLM-specific constraints, particularly context window limitations and attention degradation patterns. The analysis presents a structured framework combining human-led architectural planning with autonomous agent implementation, utilizing vertical slicing, test-driven development, and parallel execution patterns. Key findings demonstrate that LLM performance degrades predictably beyond approximately 100,000 tokens due to quadratic attention scaling, necessitating task decomposition strategies that maintain operations within the "smart zone." The framework emphasizes code visibility through deep module architecture, rigorous testing protocols, and strategic human oversight at critical decision points. Practical implications include significant productivity gains through parallelized agent workflows while preserving code quality and architectural integrity through deliberate human touchpoints in planning and review phases.

1. Introduction

The integration of Large Language Models into software development workflows represents a paradigm shift in engineering practice, yet the fundamental principles of software design retain their applicability. The primary challenge lies not in abandoning established methodologies, but in understanding how LLM-specific constraints—particularly context limitations and attention mechanisms—necessitate systematic adaptations to traditional approaches. As AI agents assume increasing responsibility for code generation, the question becomes not whether software engineering principles remain relevant, but rather how they must be restructured to accommodate the unique characteristics of LLM-based development assistants.

Context window constraints emerge as the primary consideration when delegating tasks to AI agents. While contemporary LLMs advertise context windows ranging from 200,000 to 1,000,000 tokens, practical performance considerations reveal a more nuanced operational reality. The smart zone, operationally defined as the initial approximately 100,000 tokens of context, represents the range where attention relationships remain computationally manageable and output quality remains consistently high. This limitation arises from the quadratic scaling of attention relationships: each token added to context must establish relationships with all existing tokens, analogous to adding teams to a sports league where the number of matchups grows quadratically rather than linearly.

This analysis presents a comprehensive framework for structuring AI-assisted development workflows that maintains human control over architectural decisions while enabling autonomous agent implementation. The methodology synthesizes established software engineering principles from Frederick P. Brooks, John Ousterhout, and The Pragmatic Programmer, adapting these concepts to the unique operational characteristics of LLM-based development agents. The framework emphasizes maintaining code as the primary artifact of focus, avoiding specifications-based approaches that treat AI as a black-box compiler.

2. Background and Related Work

The theoretical foundation for this approach integrates several established software engineering concepts adapted for human-AI collaboration. Frederick P. Brooks' notion of design concept from The Design of Design emphasizes the necessity of shared understanding between all participants when building novel systems. This principle becomes particularly salient when one participant is an AI agent with fundamentally different cognitive characteristics than human developers—specifically, the tendency to reset to base state when context is cleared, behaving analogously to a system with severe short-term memory limitations.

John Ousterhout's philosophy of deep modules versus shallow modules provides critical architectural guidance for AI-navigable codebases. Deep modules expose minimal interfaces while encapsulating substantial functionality, whereas shallow modules proliferate exports with limited individual capability. This architectural distinction proves essential for AI agent navigation, as shallow module structures require manual dependency tracking and present unclear testing boundaries, while deep modules enable comprehensive testing with well-defined boundaries.

The tracer bullet methodology, originating from The Pragmatic Programmer, advocates for vertical slices that traverse all system layers (database, API, frontend), providing integrated feedback early in development cycles. This contrasts with horizontal slicing approaches that implement complete layers sequentially, delaying feedback until final integration phases. The tracer bullet metaphor—phosphorescent ammunition that glows in flight, providing immediate aim feedback—captures the value of early integrated feedback across the entire system stack.

3. Core Analysis

3.1 LLM Operational Constraints and Task Decomposition

The operational characteristics of LLM attention mechanisms impose concrete constraints on task sizing and workflow structure. Performance degradation beyond the smart zone threshold occurs predictably due to the quadratic scaling of attention relationships. Each token added to context creates relationships with all existing tokens, resulting in computational complexity that grows as O(n²) rather than linearly. This mathematical reality necessitates task decomposition strategies that maintain operations within the smart zone.

Context management strategies reveal important trade-offs between different approaches. Compacting, which involves summarizing conversation history to reduce token count, creates inconsistent state representations and unpredictable behavior. In contrast, clearing context provides predictable, repeatable behavior by returning the agent to its base system prompt state. This predictability proves essential for reliable agent behavior, despite the apparent disadvantage of losing conversation history. The analogy to severe anterograde amnesia captures this characteristic: the agent continually forgets and resets, making context clearing preferable to maintaining degraded, summarized state.

Task sizing must therefore conform to the principle of not "biting off more than you can chew"—a constraint applicable both to human developers and AI agents. Tasks structured to fit within the smart zone enable consistent performance, while tasks exceeding this threshold produce degraded outputs as the agent operates in the "dumb zone" where attention relationships become strained.

3.2 Establishing Shared Understanding Through Structured Interviewing

The Grill Me Skill represents a systematic approach to establishing shared understanding between human developers and AI agents. This technique involves the AI conducting relentless interviews covering every aspect of a planned implementation, walking down each branch of the design tree and resolving dependencies systematically. Sessions typically span 40-100 questions and generate substantial conversation history that functions as a design concept asset.

The purpose of this grilling process extends beyond producing documentation—it establishes genuine shared understanding analogous to Brooks' design concept. This distinction proves critical: the goal is not to generate a specification document that will be handed off to a different entity, but rather to ensure that the AI agent participating in implementation possesses the same contextual understanding as the human architect. This shared understanding enables more effective autonomous implementation, as the agent operates from the same conceptual foundation as the human designer.

This technique adapts effectively to various scenarios, including pair programming with domain experts or validation of assumptions from meeting transcripts. The conversation history generated during grilling sessions becomes a reusable asset that can be referenced during implementation phases, though the primary value lies in the shared understanding achieved rather than the artifact produced.

3.3 Vertical Slicing and Parallelization Strategies

The decomposition of work into vertical slices rather than horizontal layers enables both early feedback and parallelization. Horizontal slicing (implementing complete layers sequentially: database → API → frontend) delays feedback loop closure until final integration phases. Vertical slicing crosses all layers for a single feature, providing integrated feedback immediately and enabling independent parallel execution.

This architectural approach maps directly to Kanban board structures with blocking relationships, creating directed acyclic graphs (DAGs) where independent tasks can execute simultaneously. Sequential phase plans (Phase 1 → Phase 2 → Phase 3) can only be executed serially by a single agent, whereas Kanban boards with explicit dependency relationships allow multiple agents to work on independent slices concurrently. The first vertical slice should include schema changes, service creation, and minimal frontend representation, establishing the complete flow through all system layers.

The Ralph loop (or Ralph Wigum pattern) provides an alternative approach: specify the end destination in a Product Requirements Document, then repeatedly instruct the AI to make small incremental changes approaching the destination. While functional, multi-phase plans with vertical slices provide superior structure and control, enabling more predictable outcomes and clearer progress tracking.

3.4 Test-Driven Development as Quality Guardrail

Test-Driven Development (TDD) emerges as essential for maintaining AI-generated code quality, following the red-green-refactor cycle: write failing test, implement to pass test, refactor implementation. This approach prevents AI agents from "cheating" at tests by writing implementation and tests simultaneously. When tests are instrumented before implementation, the agent must genuinely satisfy test requirements rather than generating tests that trivially pass existing implementation.

The quality of feedback loops establishes the ceiling for AI output quality. AI agents code "blind" without robust feedback mechanisms; therefore, the quality of testing infrastructure, type checking, and other automated verification directly determines output quality. This observation emphasizes the importance of investing in comprehensive testing frameworks and type systems as foundational infrastructure for AI-assisted development.

Implementation becomes an AFK (away from keyboard) task once planning completes—humans exit the loop while agents work autonomously. The implementation prompt structure prioritizes tasks systematically: critical bugs, then infrastructure issues, then tracer bullets, then quick wins. This prioritization ensures that foundational elements are addressed before feature work proceeds.

3.5 Human Touchpoints: Code Review and Quality Assurance

While implementation can proceed autonomously, code review and QA represent critical human touchpoints where taste and architectural judgment are imposed on the codebase. Automating these phases entirely produces "slop"—technically functional but aesthetically and architecturally compromised code lacking human refinement. Code review volume increases significantly with AI delegation, necessitating balance through small, self-contained pull requests.

Critically, review should occur in the smart zone (with clear, manageable context) rather than the dumb zone, ensuring the reviewer operates with superior cognitive capacity compared to the implementer. This inverts potential scenarios where the reviewer operates under more constrained conditions than the implementer, leading to inadequate review quality.

Manual QA proves essential for verifying integrated system behavior end-to-end, not merely individual component functionality. QA phases generate new issues for the Kanban board, creating continuous feedback loops that refine implementation quality iteratively.

4. Technical Insights

The architecture of AI-navigable codebases requires deliberate attention to module depth and interface design. Deep modules with small interfaces but substantial encapsulated functionality enable straightforward testing and comprehension. A single test boundary can wrap the entire module, catching integration issues within the module scope. Conversely, shallow modules with numerous exports and limited individual functionality prove difficult for AI navigation, requiring manual dependency tracking and presenting unclear test boundaries.

The improve codebase architecture skill provides systematic identification of opportunities to deepen modules. This involves scanning for clusters of related modules that could be tested as single units, identifying coupling patterns and dependencies. Wrapping entire flows (frontend to backend) in single testable modules using discriminated unions or similar patterns dramatically improves AI capability to reason about and modify code. One implementation example demonstrated that wrapping a browser-based video editor flow in a single module substantially enhanced the AI's ability to make coherent changes across the entire flow.

Push versus pull strategies for coding standards reveal important trade-offs. Push strategies include instructions in configuration files (e.g., claw.md) that are always sent to agents, while pull strategies make information available for retrieval when needed. For implementation phases, coding standards should be available via pull mechanisms, allowing agents to reference them as needed. For automated review, coding standards should be pushed to the reviewer alongside code, enabling direct comparison. Different models can be employed for different phases: faster models like claude-sonnet for implementation, more capable models like claude-opus for review requiring deeper analysis.

Sand Castle, a TypeScript library for parallel agent loops in Docker sandboxes, enables true parallelization. A planner agent analyzes the backlog and blocking relationships, determining which issues can be worked concurrently. Each issue receives its own sandbox (git branch) and dedicated implementation agent. A merger agent combines completed work and resolves type conflicts and test failures, enabling systematic parallel execution across multiple independent vertical slices.

5. Discussion

The framework presented synthesizes traditional software engineering principles with LLM-specific operational constraints, suggesting that the fundamental challenge is not replacing established methodologies but rather adapting them to accommodate AI agent characteristics. The emphasis on maintaining code visibility and avoiding specs-to-code approaches—where specifications are iteratively refined while ignoring actual code—reflects a critical insight: code remains the battleground, and developers must maintain understanding of and control over the codebase rather than treating AI as a black-box compiler.

Several areas warrant further investigation. The precise threshold for smart zone degradation likely varies across models and may shift with architectural improvements to attention mechanisms. Systematic empirical measurement of performance degradation curves across different models and task types would provide more precise guidance for task sizing. Additionally, the optimal balance between human oversight and autonomous execution likely varies across project types, team compositions, and domain complexities.

The tension between maximizing AI autonomy for productivity gains and maintaining sufficient human oversight for quality and architectural integrity represents an ongoing calibration challenge. The framework presented suggests specific touchpoints (planning, code review, QA) where human judgment proves essential, but the optimal frequency and depth of these interventions likely depends on contextual factors including codebase maturity, team experience, and domain criticality.

6. Conclusion

This analysis presents a comprehensive framework for integrating LLM-based agents into software development workflows while preserving code quality and architectural integrity. The key contributions include: (1) identification of the smart zone threshold and its implications for task decomposition; (2) structured methodologies for establishing shared understanding through systematic interviewing; (3) vertical slicing and parallelization strategies enabling concurrent agent execution; (4) test-driven development as essential quality guardrail for autonomous implementation; and (5) strategic human touchpoints in planning and review phases.

Practical takeaways emphasize that software engineering fundamentals remain applicable but require systematic adaptation to LLM constraints. Task sizing must respect context limitations, architectural decisions should favor deep modules over shallow ones, and comprehensive testing infrastructure establishes the quality ceiling for AI output. The framework enables significant productivity gains through parallel autonomous implementation while maintaining human control over architectural decisions and quality standards. Organizations implementing AI-assisted development should focus on establishing robust testing infrastructure, defining clear vertical slices with explicit dependencies, and maintaining disciplined human oversight at planning and review phases rather than attempting to automate the entire development lifecycle.

Sources

-QFHIoCo-Ko - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub