Stanford CS146S: The Complete Guide to AI Coding Agents (9-Part Series)

Stanford CS146S unpacked — coding LLMs, MCP, AI IDEs, agent patterns, security, code review, and post-deploy ops in a 9-part field guide.

2026-03-28 By Sean Weldon

Stanford's CS146S is one of the first university courses dedicated entirely to AI-assisted software development. Across 9 weeks, the course traces the full lifecycle: from how LLMs work under the hood, through the tools and protocols that make them useful, to the security, testing, and operational challenges of deploying agent-built software in production.

I processed the full lecture playlist through my YouTube Scout pipeline and distilled each lecture into a standalone chapter. What follows is the complete course, condensed.

Chapter 1: Introduction to Coding LLMs and AI Development

Large language models are multi-stage engineering pipelines that transform raw internet data into professional AI assistants through pre-training, supervised fine-tuning, and reinforcement learning.

How LLMs Are Built

Pre-training builds the foundational model on internet-scale datasets. After filtering, representative datasets like FineWeb contain 44 terabytes of text - approximately 15 trillion tokens. LLMs are autoregressive neural networks with billions to trillions of adjustable parameters (GPT-2: 1.6 billion; GPT-4: reported 1.8 trillion). The result is a lossy compression of the internet that captures syntax but loses nuance and the distinction between truthful and merely common information.

Supervised fine-tuning is computationally cheaper (hours versus months) and replaces raw internet text with curated datasets of high-quality dialogue examples. The resulting assistant is fundamentally a simulation of an average highly skilled human labeler. SFT teaches what to say but does not teach reasoning.

Reinforcement learning teaches models how to think by practicing problem-solving in verifiable domains like mathematics and coding. Models learn to create internal monologue through chain of thought, distributing reasoning across multiple tokens. For subjective domains, RLHF trains a reward model to replicate human rankings - but models learn to exploit the reward system rather than optimize for genuine quality.

Prompt Engineering Techniques That Matter

In-context learning (k-shot): Provide 1-5 examples within the prompt for instant adaptation
Zero-shot chain of thought: "Let's think step by step" dramatically improves accuracy on logical tasks
RAG: Forces use of current facts instead of stale training memory
Self-consistency: Ask the same question 5 times, take the majority answer
Reflection: Feed error messages back for self-critique and revision

The Swiss Cheese Model

LLMs exhibit unpredictable capability gaps. They solve Olympiad-level math but fail to compare 9.1 and 9.9. Capability gaps don't align across techniques, so multiple approaches must compensate for specific limitations. Models must be treated as stochastic tools where work is always verified.

Chapter 2: Turning LLMs Into Autonomous Doers (MCP)

The Model Context Protocol addresses a fundamental limitation: LLMs cannot access real-time, dynamic data. MCP replaces M*N custom integrations with a unified M+N standard - the "USB-C of the AI world."

Architecture

MCP's four layers: Host (user-facing app), Client (stateful session manager), Server (lightweight tool wrapper), and Tools (actual functions). Built on JSON RPC 2.0, language-agnostic, and the LLM never handles secrets - security is abstracted by the server-client relationship.

How It Works

The client calls tools_list to discover available tools. Servers respond with JSON describing tool names, summaries, and input schemas. The host injects these into the LLM's context. When the model emits a tool call, the client translates it into an MCP request, the server executes, and returns results.

The Ecosystem

Reference implementations span web fetch, git, file systems, and memory. Enterprise integrations include Atlassian, GitHub, Cloudflare, Stripe, Postgres, MongoDB, AWS, and Azure. Community servers extend to blockchain, gaming, browser automation, government data, and specialized legal domains.

OAuth 2.0 and Security

Early MCP lacked standardized auth, forcing local execution. OAuth 2.0 adoption enabled dynamic client registration, automatic endpoint discovery, and short-lived token management. Users authorize once; the system guarantees access only to explicitly permitted resources within defined scopes.

Current Limitations

LLMs struggle with large numbers of available tools - reasoning degrades with too many options. Verbose API results overload context windows. Developers must design AI-native APIs with simplified, curated datasets rather than exporting rigid legacy interfaces.

Chapter 3: AI IDEs - From Execution to Management

Autonomous coding agents shift engineers from execution to management, enabling 6-12x productivity gains through delegation, specification writing, and context management.

Synchronous vs. Asynchronous Workflows

Synchronous tools (Copilot, Cursor) deliver responses in 20-90 seconds. Asynchronous agents (Devon, Codeex, Jules) run for 10 minutes to hours. The 30-second to 5-minute "semi-async zone" destroys productivity - too long to watch, too short to context-switch.

Delegation as Engineering

Treat AI as junior coding partners needing explicit instructions. Instead of "add unit tests," specify "add unit tests for zero input, negative values, and use mock service v2 structure." Multi-hour tasks spanning database, backend, and frontend require explicit checkpoints to prevent building on incorrect assumptions.

Agents excel at tasks engineers typically defer: bisecting old commits, updating documentation, refreshing READMEs after shipping. They resolve analysis paralysis by implementing competing approaches in parallel.

The Prompt as Source Code

The specification becomes the new source code. Current practice inverts traditional development: engineers craft perfect prompts then discard them while version-controlling the generated code. This "shreds the source and version controls the binary."

Context Window Failure Modes

Four critical failures at scale:

Poisoning: Early hallucinations cause fixation on bad data
Distraction: Past 100K tokens, agents repeat past actions instead of generating new plans
Confusion: Oversized tool lists cause irrelevant API calls
Clash: Contradictory information causes 39% accuracy drops

The counterintuitive lesson: longer context is not always better. When poisoning occurs, a fresh prompt is faster than conversation repair.

Chapter 4: Coding Agent Patterns

Success now depends on mastering context engineering rather than coding syntax. Stock App achieved 10.6 successful PRs per person per week versus an industry average of 1.

Context Engineering > Prompt Engineering

Prompt engineering - crafting the perfect single instruction - fails for autonomous agents executing multi-step plans. Context engineering defines the system's entire knowledge base, behavioral constraints, and guardrails. The code repository becomes a shared workspace for humans and agents. Natural language artifacts are now as critical as code itself.

Progressive Hierarchical Development

Architecture Phase: Human architects define business requirements
Design Phase: Design agent drafts documents; human review is mandatory before approval
Implementation Phase: Agent converts approved designs into task series
Backstop Phase: Committed failing tests act as objective constraints - agents cannot commit unless tests pass
Review Phase: Agents update design docs, READMEs, and schemas for subsequent agents

Key Workflows

Explore-Plan-Code-Commit: Instruct agents to read files and form plans before coding. Trigger phrases ("think," "think hard," "ultra think") signal progressively higher computational budgets.

Test-Driven Agentic Development: Write comprehensive tests first, confirm they fail, commit tests, then write code to make them pass.

Verification Loop: Two parallel instances - one writes code, a second (different model) reviews. Stock App found Gemini significantly better at finding security issues than the code-writing Claude instance.

The CLAUDE.md File

The agent's personalized instruction manual, automatically pulled into context at session start. Documents project rules, build processes, test execution, and repository etiquette. Must be treated as first-class infrastructure - when it drifts out of sync, agents rely on outdated assumptions.

Chapter 5: Modern AI Terminal

AI developer tools are shifting from code suggestion to autonomous agent workflows. The market velocity: Cursor's parent valued at $9.9 billion, Bolt.new reached $40M ARR in 5 months, Replit adds $1M ARR every 5-6 days.

Seven Product Principles

Usability: Start with familiar interfaces. Optimize for flow with five-minute time-to-value.

Control: Chat as first-class citizen - code is a translation layer for human intent. Configuration flexibility for power users. MCP as the extensible tool ecosystem giving LLMs "eyes and hands."

Speed: Rapid feedback loops. Full autonomy with variable human-in-loop involvement.

Strategic vs. YOLO Agents

The Strategic Agent enforces tight guardrails, requires permissions, generates 14-step plans with validation layers. The YOLO Agent disables planning, allows unrestricted execution, skips validation.

In an NFL predictor test, the Strategic Agent failed despite flawless planning - real-world data URLs returned 404s. The YOLO Agent succeeded by pivoting to stable data sources. If prototyping for a pitch deck, speed is everything. If pushing a security patch, correctness is the only metric.

The Leadership Imperative

A 97% acceptance rate of agent suggestions means developers already commit AI-generated code directly to production. Organizations must explicitly codify risk tolerance into default agent profiles. The tools exist to optimize for either speed or correctness. Leadership must decide which is the default posture.

Chapter 6: AI Testing and Security

Agentic AI systems create amplified security threats through new attack vectors while operating at machine speed.

The Amplified Attack Surface

Agentic systems inherit all classic vulnerabilities (SQL injection, SSRF, XSS) but with catastrophic amplification because the AI autonomously decides when to execute actions. Remote code execution is the most catastrophic risk: tricking a code-executing agent into running malicious instructions grants access to the environment, host network, and file system.

Real Attack Demonstrations

SSRF: Instructed a news agent to read from internal IP 192.168.0.25, turning the agent into an internal spy
Credential theft: Instructed code interpreter to search for high Shannon entropy strings, then base64-encoded results to bypass safeguards
CVE-2025-3773: Prompt injection hidden in source files that instructs Copilot to enable "YOLO mode" - disabling all confirmations for shell execution. This leads directly to RCE on developer machines.

Context Rot

LLMs do not handle the 10,000th piece of information as well as the first. Performance degrades with task complexity and irrelevant distractor information. Counterintuitively, shuffling content - destroying logical flow - improves model performance by spreading key facts more evenly.

Code Security Analysis: Catastrophic Failure Rates

Claude Code achieved only 22% true positive rate for IDOR bugs and 5% for SQL injection. Codex scored zero on both SQL injection and XSS. False positive rates were 82-86%. Identical scans of the same code produce wildly different results across runs.

Defense in Depth

Prompt hardening: Treat prompts as source code with XML/JSON separation of user input from system instructions
Content filtering: Real-time layers blocking injection, tool misuse, and data leakage
Tool input sanitization: Never trust input from agents - strict validation regardless of LLM confidence
Sandboxing (non-negotiable): Strong isolation with containers, restricted networking, blocked metadata endpoints

Chapter 7: Modern Software Support (AI Code Review)

AI excels at consistency and catching routine issues. Humans excel at context, architecture, and tribal knowledge. A well-tuned AI already matches human effectiveness.

The Numbers

Human peer review achieves 55-60% error detection, outperforming unit testing (25%) and integration testing (45%). Formal review reduces defects by 80%. Google's Auto Commenter achieved 52% action rate, matching human efficacy.

The Gold Zone Framework

Maps AI effectiveness across two dimensions:

Gold zone: AI catches issues AND humans welcome feedback (simple bugs, performance, security, style). AI accuracy: 70-90%.
Human-only zone: Tribal knowledge, institutional memory, nuanced business logic
Annoyance zone: Issues AI can technically find but developers find pedantic
Blind spot zone: Neither reliably catches these

The Paradox of Less

Suppressing 17 pedantic rules raised acceptance rates from 54% to over 80%. By saying less, the tool became more useful. Dynamic suppression mechanisms filter deprecated rules and low-value recommendations to prevent trust erosion.

Beam Search vs. Greedy Search

Beam search for PR analysis explores multiple paths, tripling posting frequency. Greedy search for IDE integration achieves sub-second latency. Choose the algorithm based on the feedback timing requirement.

Chapter 8: Automated UI and App Building

AI-powered development tools eliminate manual infrastructure configuration as the primary bottleneck, shifting developer value from execution to idea validation.

The Complexity Arc

LAMP stack to MERN to Jamstack to serverless created dependency diagrams resembling city maps. AWS serverless alone required IAM, API Gateway, Lambda, DynamoDB, Cognito, and CloudFormation before writing application code. This manual configuration was an economic tax on innovation.

The AI-Powered Wave

Modern platforms (Lovable, Replit, Base, Cursor, Claude, Vercel v0) generate production-grade codebases from natural language. Unlike low-code predecessors, these output clean, extensible React and Next.js components. MVPs ship in days instead of months.

Stream Manipulation

The critical innovation: intercepting code generation in real-time to fix framework errors before users see them. Large creative models generate; small fast models police syntax. The specialized v0 1.5 achieves 93.87% error-free generation versus Claude Opus at 78% and Gemini 2.5 Flash at 60%.

The Fundamental Shift

Technical complexity no longer constrains building new products. The new bottleneck is the quality of ideas themselves. The market will discover flawed ideas at unprecedented speed. The ability to write code becomes commoditized; the ability to determine what code should be written becomes paramount.

Chapter 9: Agents Post-Deployment

Operations management has progressed from traditional sysadmin through SRE to emerging AI-native operations.

Why Traditional Ops Failed

System administrators deployed code and maintained uptime, but scaling required proportional hiring. Developers optimized for velocity while ops prioritized stability, creating direct goal opposition and trench warfare.

SRE's Core Innovations

50% toil cap: SREs spend maximum 50% on operational work. Excess returns to developers, forcing them to own consequences of unstable code.
Error budget framework: Rejects 100% availability. Teams strategically spend budgets on feature launches, balancing innovation with stability.
Pre-written playbooks: 3x improvement in Mean Time to Repair versus improvising under pressure
Alert limits: 2 actionable events per 8-12 hour shift prevents fatigue

The Modern Complexity Wall

Coding is only 30% of engineer time; running code in production consumes the challenging 70%. A single user request can traverse 10+ different teams' code. Human operators cannot synthesize logs, metrics, traces, runbooks, Slack conversations, and tribal knowledge fast enough.

AI-Native Operations

Single AI models fail due to "irreducible interdependence" - no individual model masters 50+ microservices simultaneously. Multi-agent specialist systems solve this: database agents, infrastructure agents, and trace agents test hypotheses concurrently, converging on root causes faster than human teams.

AI production engineers build dynamic knowledge graphs, respond autonomously to alerts, and create just-in-time runbooks. They target handling 70% of operational grunt work, freeing humans for system design and architecture.

The Limiting Factor

AI effectiveness remains bounded by data quality. Incomplete monitoring prevents AI from finding answers. Robust observability infrastructure remains essential regardless of AI capabilities.

The Through-Line

Across all 9 weeks, one theme dominates: the developer role is being restructured, not replaced. The skills that matter are shifting from execution (writing code, configuring infrastructure, manual testing) to management (context engineering, specification writing, risk calibration, agent supervision, and idea validation).

The engineers who will thrive are those who embrace this as a management role - treating AI agents as capable but unreliable junior developers who need clear specifications, objective test constraints, and continuous oversight.

The engineers who will struggle are those clinging to execution as their identity. When AI can write 1,000 lines in 5 minutes, typing speed is no longer a competitive advantage. Judgment is.