Local AI Coding Workflow

Local AI coding workflow for 2026 — Qwen 3.5 35B at 100-140 tok/s on RTX, LM Studio's three API endpoints, and 80k-token context for Claude Code.

2026-03-01 By Sean Weldon

Local AI Coding in 2026: Practical Workflows with Quen Models and Claude Code

TL;DR

Local AI coding workflows in 2026 combine Quen 3.5 models (35B parameters), LM Studio linking, and Claude Code integration to achieve practical productivity on consumer hardware like the RTX 1590 with 32GB VRAM. Full-stack applications can be built in 30 minutes, achieving 100-140 tokens per second through mixture-of-experts architecture, though developers must optimize context windows to 80,000+ tokens and accept increased debugging compared to cloud-based models. This approach is transforming how we deliver web development projects.

Key Takeaways

RTX 1590 GPUs with 32GB VRAM achieve 100-140 tokens per second with Quen 3.5 (35B parameters) using mixture-of-experts architecture, but models must fit entirely on GPU memory rather than system RAM to avoid severe performance degradation from constant data transport.
LM Studio's linking functionality creates encrypted cross-device connections that allow running models on high-performance Ubuntu machines while accessing them seamlessly from lower-power devices like MacBooks without performance penalties.
Claude Code's system prompt alone consumes 3,000+ tokens before any user interaction, requiring minimum context windows of 80,000 tokens to accommodate file ingestion and avoid indefinite hangs with no clear error messaging.
Sub-agent strategies generate fresh context windows for individual tasks, maximizing limited context capacity more effectively than single-agent approaches for local AI coding workflows with constrained hardware.
Local models produce more bugs and hallucinations than state-of-the-art cloud models, requiring detailed API documentation and explicit instructions to reduce hard-coded values, though full-stack applications remain achievable within 30-minute build times.

What Hardware Do I Need for Local AI Coding in 2026?

I'm running a RTX 1590 with 32GB VRAM that achieves 100-140 tokens per second with Quen 3.5, a model with 35 billion parameters. This performance comes from the mixture-of-experts architecture, where not all parameters are active per query—only a subset activates for each request, making large models viable on consumer hardware.

The critical requirement is that models must fit entirely on GPU memory. Loading model parameters into system RAM instead of GPU causes significant performance degradation due to constant data transport between RAM and GPU. This constraint becomes especially important for agent coding with large context windows, where compute cost scales exponentially.

Just because you can technically fit a model on your system by putting some parameters on system RAM doesn't mean it's actually usable in practice. You really have to experiment and see which model can truly fit on your GPU at acceptable speeds if you want to code proper solutions with it.

How Does LM Studio Linking Work Across Devices?

LM Studio's linking functionality enables encrypted connections between devices to expose local models across machines. Setup requires opening LM Studio on both devices—linked models then appear automatically in the interface without additional configuration.

I run models on my high-performance GPU Ubuntu machine while accessing them from my MacBook seamlessly. GPU utilization spikes briefly when processing queries through linked connections, confirming that compute remains on the host machine. This cross-device setup allows lower-power devices to benefit from high-performance GPU inference without local hardware requirements.

The encrypted connection ensures security while maintaining the convenience of cloud-based workflows. Privacy-focused developers gain the benefits of distributed access without sending data to external servers.

How Do I Connect Claude Code to Local Models?

LM Studio exposes three API endpoints: native chat interface, OpenAI-compatible endpoint (/v1/chat/completions), and Anthropic-compatible endpoint (/v1/messages). Claude Code connects to local models via environment variable overrides—specifically ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY.

This configuration enables drop-in replacement of cloud services with local inference. However, Claude Code injects a substantial system prompt that consumes 3,000+ tokens before any user interaction begins. This overhead significantly slows response times compared to empty chat sessions.

Language models adopt behavior from system prompts they receive. Even though I'm using a Quen model, because that system prompt says it's Claude, the model thinks this as well. Models don't always have self-awareness of what they actually are—the system prompt they are fed really dictates their behavior very much.

What Context Window Settings Work Best?

The default 4,000 token context window causes indefinite hangs when Claude Code's system prompt exceeds available space with no clear error messaging. Context windows must be set to minimum 80,000+ tokens to accommodate Claude Code's system prompt plus file ingestion without hitting limits.

I found that setting context windows to 200,000 tokens allows longer task completion times in exchange for reduced speed. Context overflow behavior is configurable in LM Studio—truncating the middle of conversation history preserves initial exploration while freeing space for new interactions.

Claude Code can proactively summarize conversation history to manage limited context windows. This feature helps maintain continuity when working with constrained local hardware that can't match cloud-based context capacities.

Token usage tracking shows instructive patterns:

System prompt alone: 3,000 tokens
Planning phase: 65,000 tokens
Total available context: 80,000-200,000 tokens depending on configuration

What Is the Sub-Agent Strategy for Local AI Coding?

Creating sub-agents for each task generates fresh context windows for individual pieces of work, maximizing limited context window capacity. This sub-agent approach is recommended for local AI coding to work around context limitations more effectively than single-agent approaches.

Each sub-agent starts with a clean slate, avoiding the context pollution that accumulates in long-running coding sessions. Bypass all permissions mode can be safely used within dev containers to allow autonomous code generation without manual approval for each action.

Dev containers provide isolation for safe execution of Claude Code in bypass mode. This combination of sub-agents and containerization enables productive workflows despite the hardware constraints of local inference.

How Do I Reduce Bugs and Hallucinations?

Local models produce more bugs and hallucinations than state-of-the-art cloud models like Claude Opus. Hard-coded values appear when specifications aren't detailed—I've seen models insert incorrect GPU models and maximum context windows that require manual fixes.

Providing detailed API documentation as markdown grounds models in specific endpoint behavior and improves accuracy. Enabling agents to call backend APIs directly allows self-assessment of API call correctness and output format alignment.

Passing explicit instructions for API integration reduces hallucination and improves bug-free code generation. Even with these optimizations, local models still require more debugging than cloud alternatives. The trade-off is privacy and control over your development environment versus the superior accuracy of cloud-based models.

What Can I Actually Build with Local AI Coding?

I built a full-stack application with Next.js and TypeScript dashboard plus proxy backend in approximately 30 minutes using local models. This demonstrates that full-stack application development is achievable with local models, though the workflow requires ongoing debugging and refinement.

Response times are much slower with Claude Code connected to local models compared to cloud-based Claude Opus, especially as repository size grows. The 30-minute build time shows feasibility but represents a significantly longer timeline than equivalent cloud-based workflows.

Local AI coding workflow is significantly more powerful than 2 years ago but not equivalent to state-of-the-art cloud models. Privacy-focused developers should adopt local AI coding despite performance trade-offs, as the capability gap continues to narrow with each model generation.

What the Experts Say

"This is the benefit of a mixture of expert model which is very common with modern local AI systems."

Mixture-of-experts architecture represents a fundamental shift in how local models achieve performance on consumer hardware. By activating only relevant parameters per query, these models deliver practical inference speeds that would be impossible with traditional architecture.

"Just because you can fit a model on your system by putting some of the parameters on your system RAM doesn't mean that it's actually going to be usable in practice."

This insight highlights the critical difference between theoretical compatibility and practical usability. Many developers waste time attempting to run models that technically fit in memory but perform too slowly for productive coding workflows.

"They don't always have self-awareness of the model that they actually are. The system prompt that they are fed really dictates their behavior very much."

Understanding how system prompts shape model behavior is essential for debugging unexpected responses. Local models adopt the identity and behavior patterns described in their system prompts, regardless of their underlying architecture.

Frequently Asked Questions

Q: What GPU do I need for local AI coding in 2026?

An RTX 1590 with 32GB VRAM achieves 100-140 tokens per second with Quen 3.5 (35B parameters). Models must fit entirely on GPU memory—using system RAM causes severe performance degradation. Experiment with different models to find which truly fits your GPU at acceptable speeds for production coding.

Q: How fast are local AI models compared to cloud services?

Local models with Quen 3.5 achieve 100-140 tokens per second on RTX 1590 hardware, significantly slower than cloud-based Claude Opus. Response times degrade further as repository size grows. Full-stack applications take approximately 30 minutes to build locally versus faster cloud-based workflows, though local models offer privacy benefits.

Q: Can I access local models from multiple devices?

LM Studio's linking functionality creates encrypted connections between devices, exposing local models across machines. Run models on high-performance GPU machines while accessing them from laptops seamlessly. Setup requires opening LM Studio on both devices—linked models appear automatically without additional configuration.

Q: Why does Claude Code hang with local models?

Claude Code's system prompt consumes 3,000+ tokens before user interaction. Default 4,000 token context windows cause indefinite hangs with no error messaging when the system prompt plus file ingestion exceeds available space. Set context windows to minimum 80,000 tokens to accommodate Claude Code's requirements.

Q: What is the sub-agent strategy for local AI coding?

Creating sub-agents for each task generates fresh context windows for individual work pieces, maximizing limited context capacity. Each sub-agent starts clean, avoiding context pollution from long-running sessions. This approach works more effectively than single-agent approaches for local AI coding with hardware constraints.

Q: How do I reduce hallucinations in local AI coding?

Provide detailed API documentation as markdown to ground models in specific endpoint behavior. Enable agents to call backend APIs directly for self-assessment of correctness. Pass explicit instructions for API integration. Despite optimization, local models still produce more bugs than cloud models, requiring increased debugging.

Q: What context window size should I use with Claude Code?

Set context windows to minimum 80,000 tokens to accommodate Claude Code's 3,000-token system prompt plus file ingestion. Configure up to 200,000 tokens for longer task completion at reduced speed. Enable context overflow truncation to remove middle conversation history while preserving initial exploration context.

Q: Are local AI coding workflows worth it in 2026?

Local AI coding is significantly more powerful than 2 years ago but not equivalent to state-of-the-art cloud models. Full-stack applications are achievable in 30 minutes with increased debugging. Privacy-focused developers should adopt local AI coding despite performance trade-offs, as the capability gap continues narrowing.

The Bottom Line

Local AI coding workflows in 2026 achieve practical productivity through careful hardware optimization, context management, and realistic expectations about performance gaps compared to cloud services.

Privacy-focused developers now have viable alternatives to cloud-based coding assistants, with full-stack applications achievable in 30-minute build times using consumer hardware like the RTX 1590 with 32GB VRAM. The sub-agent strategy, proper context window configuration (80,000+ tokens), and detailed API documentation reduce the increased debugging burden that local models require. While local models produce more bugs and hallucinations than Claude Opus, the mixture-of-experts architecture in models like Quen 3.5 delivers 100-140 tokens per second—fast enough for productive coding workflows.

Start by experimenting with which models truly fit on your GPU at acceptable speeds, configure LM Studio linking for cross-device access, and adopt the sub-agent approach to maximize your limited context windows. The capability gap between local and cloud AI coding continues narrowing with each model generation, making now the right time to establish privacy-preserving development workflows that keep your code and data under your control.

Sources

Local AI Coding Workflow - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub