Building a Chess Coach — Anant Dole and Asbjorn Steinskog, Take Take Take

Play Magnus built an AI chess coach that explains game moves by combining classical chess engines with LLMs, solving the problem that LLMs hallucinate at che...

2026-05-17 By Sean Weldon

Grounding LLM-Based Chess Coaching Through Structured Analysis Pipelines

Abstract

This paper examines the architecture and methodology behind Play Magnus's AI-powered chess coaching system, which addresses a fundamental limitation of Large Language Models (LLMs): their tendency to hallucinate when performing domain-specific reasoning tasks. The system employs a hybrid architecture that separates structured data extraction from natural language generation, using classical chess engines and specialized detectors to establish ground truth before engaging LLMs solely for translation into human-readable commentary. Evaluation across multiple models demonstrates that Gemini Flash achieves approximately 75% accuracy on chess scenario test cases while maintaining sub-3-second latency requirements for consumer applications. The implementation incorporates an autonomous agent feedback loop that enables continuous quality improvement through human-in-the-loop validation. This approach offers generalizable insights for building reliable AI systems in specialized domains where pure LLM reasoning proves insufficient.

1. Introduction

The intersection of artificial intelligence and domain expertise presents persistent challenges in building systems that combine computational reasoning with natural language explanation. While Large Language Models have demonstrated remarkable capabilities in language understanding and generation, their application to specialized domains requiring precise logical reasoning remains problematic. Chess, as a domain with well-defined rules, deterministic outcomes, and extensive computational history, provides an ideal testbed for examining these limitations and developing architectural solutions.

Play Magnus, a mobile application for iOS and Android platforms, implements an AI-powered game review system that automatically analyzes user chess games and generates contextual commentary explaining move quality, tactical patterns, and strategic reasoning. The system must satisfy competing requirements: computational accuracy in chess analysis, natural language fluency in explanations, and consumer-grade latency constraints of under three seconds for complete game reviews. The system automatically detects move quality categories (brilliant, good, bad) and generates contextual commentary that reveals insights about user play including accuracy by game phase, current rating estimation, and opening preparation depth.

This synthesis examines the architectural decisions, technical implementations, and evaluation methodologies employed to build a production chess coaching system that leverages both classical chess engines and modern LLMs. The central thesis posits that separating data pipelines from language generation—using deterministic systems for reasoning and LLMs solely for translation—provides a viable architectural pattern for building reliable AI systems in specialized domains. The analysis demonstrates how this approach addresses fundamental limitations in LLM reasoning capabilities while maintaining the natural language interface users expect from AI systems.

2. Background and Related Work

2.1 Historical Evolution of Chess AI

Claude Shannon's 1949 taxonomy established two fundamental approaches to chess computation: Type A engines employ brute-force search through all possible move sequences, while Type B engines use selective, intuitive search to evaluate promising variations. Deep Blue's 1997 victory over Garry Kasparov validated the Type A approach, effectively halting research into Type B architectures until the complexity of Go necessitated alternative methods.

The emergence of AlphaGo and AlphaZero demonstrated that neural network approaches could achieve superhuman performance through selective evaluation rather than exhaustive search. DeepMind subsequently trained transformer architectures on chess positions to predict Stockfish evaluations, achieving grandmaster-level play without explicit programming of chess rules. However, these position-trained transformers, while computationally proficient, lack the language training necessary to explain their evaluations in natural language.

2.2 The LLM Reasoning Problem

Contemporary LLMs demonstrate poor chess performance despite exposure to chess notation and game records during training. The fundamental issue stems from their training objective: language modeling rather than calculation. LLMs hallucinate moves—generating syntactically valid but strategically nonsensical sequences—because they lack the computational substrate for multi-step logical reasoning. Even reasoning models that employ chain-of-thought techniques "still fall apart" when attempting chess calculation, as observed in the source material. This limitation extends beyond chess to any domain requiring precise logical inference rather than pattern-based language generation.

3. Core Analysis

3.1 Architectural Separation of Concerns

The Play Magnus system implements a fundamental architectural principle: the separation of data extraction from language generation. This design explicitly constrains the LLM's role to translation rather than reasoning, addressing the hallucination problem at the architectural level rather than through prompt engineering alone.

The pipeline architecture operates in three distinct phases. First, Stockfish, a classical chess engine, analyzes the entire game to establish ground truth for optimal moves at each position. This deterministic analysis provides the computational foundation that LLMs cannot reliably produce. Second, a context extraction layer employs multiple specialized detectors to identify tactical patterns (forks, pins, skewers), positional themes (doubled pawns, weak squares), threats, and strategic plans. These detectors produce structured data—large JSON objects—that encode the chess-specific reasoning the system requires.

Third, the LLM receives this structured context and performs translation into natural language commentary. As articulated in the source material: "The LLM's job is only to translate this information into English, because we really don't want it to try to figure out too much on its own, because it quickly leads to hallucination." This architectural constraint prevents the model from engaging in independent chess reasoning, confining it to the linguistic task for which it was trained.

3.2 Human-Centric Move Evaluation

A notable innovation in the system involves the integration of the Maya engine, developed at the University of Toronto, which predicts human move probabilities at specific rating levels. This component enables the system to distinguish between moves that are objectively optimal according to Stockfish and moves that are subjectively difficult for human players to identify.

By comparing Stockfish's evaluation with Maya's probability distribution for a given rating level (e.g., 1,500 Elo), the system can identify positions where the best move is "objectively best but subjectively difficult to find." This distinction proves crucial for pedagogical effectiveness—a chess coach must explain not only what the best move is, but also why a human player at a particular skill level might have missed it. The integration of both engine types represents a synthesis of Shannon's Type A (brute-force) and Type B (human-like evaluation) approaches.

3.3 Autonomous Quality Improvement Loop

The system implements a feedback mechanism that enables continuous improvement through human-in-the-loop autonomous agents. When users download games or report problematic commentary, the feedback triggers an automated pipeline: events post to Slack and inject into a Claude Code channel configured as an MCP (Model Context Protocol) server.

The autonomous agent executes a commentary triage workflow that investigates the flagged position, modifies prompts, adjusts detector configurations, regenerates commentary, and verifies output quality. Critically, the agent can "ask clarifying questions back to Slack," enabling human domain experts to guide the investigation without directly modifying code. Approved changes can be submitted as pull requests directly from mobile devices, dramatically reducing the iteration cycle for quality improvements.

This architecture demonstrates a practical implementation of human-AI collaboration where the AI system handles routine investigation and code modification while human experts provide domain knowledge and final validation. The approach addresses a common challenge in production AI systems: maintaining quality as edge cases emerge in real-world usage.

4. Technical Insights

4.1 Model Selection and Latency Trade-offs

The system's consumer-facing requirements impose strict latency constraints—target response time of sub-3 seconds for complete game reviews to match user expectations of near-instant feedback. This constraint significantly influences model selection and architectural decisions.

Evaluation across multiple models via Open Router reveals distinct trade-offs. Gemini 3 Flash achieves approximately 1 second time-to-first-token and 3 seconds average end-to-end latency while maintaining ~75% accuracy on the 16-scenario evaluation test suite. Reasoning models such as Claude with extended thinking provide superior quality (~60% accuracy in the reported metrics, though this appears lower than Gemini Flash and may reflect different evaluation criteria) but exhibit unpredictable latency, making them unsuitable for immediate post-game review. The team reserves these models for future chat-based coaching experiences where users expect longer response times. GPT-4o mini offers lower latency but demonstrates reduced accuracy compared to Gemini Flash.

These findings highlight a fundamental tension in production AI systems: the models with the strongest reasoning capabilities often fail to meet consumer application latency requirements. The architectural solution—pre-computing analysis with classical engines and using fast LLMs for translation—represents a pragmatic response to this constraint.

4.2 Evaluation Methodology

The evaluation framework employs 16 chess scenario test cases covering tactical patterns, blunders, and hallucination-limiting situations extracted from real games. The system uses an LLM-as-judge technique to evaluate whether models correctly identify and explain specific chess concepts in each scenario.

Importantly, the evaluation incorporates domain expert validation as the final quality gate. Both presenters, identified as strong chess players, compare LLM-generated analysis to their own calculation and play to verify correctness. This multi-layered evaluation—automated LLM-based assessment combined with human expert validation—addresses a critical challenge in specialized domains: automated metrics alone may miss subtle domain-specific errors that experts immediately recognize.

The context extraction layer produces comprehensive JSON structures that are "iteratively pruned based on quality metrics." This approach—starting with maximal context and removing elements that don't improve output quality—provides a systematic method for optimizing the information provided to the LLM while avoiding premature optimization that might remove crucial signals.

5. Discussion

The Play Magnus architecture demonstrates a generalizable pattern for building reliable AI systems in specialized domains: use deterministic or specialized models for domain reasoning, and use LLMs exclusively for natural language interface. This approach acknowledges fundamental limitations in current LLM architectures while leveraging their strengths in language generation.

The findings suggest that the transformer architecture itself is not inherently incapable of chess reasoning—DeepMind's position-trained transformers achieve grandmaster-level play. Rather, the issue stems from training objectives: models trained on language lack the computational substrate for multi-step logical reasoning, while models trained on positions lack language capabilities. The architectural separation implemented by Play Magnus effectively combines these complementary capabilities without requiring a single model to excel at both tasks.

Furthermore, the autonomous agent feedback loop represents a promising approach to the quality maintenance challenge in production AI systems. Traditional approaches require manual investigation of each quality issue, creating a bottleneck that limits iteration speed. By automating the investigation and modification process while preserving human oversight through Slack-based interaction, the system achieves rapid iteration without sacrificing domain expert validation.

The latency-quality trade-off observed across models raises important questions about the deployment of reasoning models in consumer applications. While extended thinking and chain-of-thought approaches improve output quality, their unpredictable latency makes them unsuitable for contexts where users expect immediate responses. This suggests that different AI architectures may be appropriate for different interaction modalities within the same application—fast models for immediate feedback, reasoning models for asynchronous or chat-based interactions.

6. Conclusion

This analysis demonstrates that reliable AI systems in specialized domains require architectural patterns that separate domain reasoning from language generation. The Play Magnus chess coaching system achieves production-quality results by using classical chess engines and specialized detectors to establish ground truth, then constraining LLMs to translation tasks rather than independent reasoning. Evaluation shows Gemini Flash achieves 75% accuracy on chess scenarios while meeting sub-3-second latency requirements, with reasoning models reserved for contexts where users accept longer response times.

The key practical takeaways for AI system builders include: (1) separate data pipelines from language generation when domain reasoning requires precision that current LLMs cannot reliably provide; (2) implement automated evaluation with domain expert validation rather than relying solely on either approach; (3) design autonomous agent feedback loops to accelerate quality iteration while preserving human oversight; and (4) select models based on deployment context, recognizing that latency constraints often dictate architectural decisions in consumer applications.

Future work might explore the application of this architectural pattern to other specialized domains requiring precise reasoning, investigate methods for reducing the latency of reasoning models to enable their use in immediate-feedback contexts, and examine whether fine-tuning LLMs on domain-specific structured reasoning tasks can reduce the need for extensive context extraction pipelines. The success of the Play Magnus system suggests that hybrid architectures combining classical AI techniques with modern LLMs may prove more effective than pure LLM approaches for many production applications.

Sources

Building a Chess Coach — Anant Dole and Asbjorn Steinskog, Take Take Take - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub