Agents on the Canvas in tldraw — Steve Ruiz, tldraw

AI agents can be effectively integrated into canvas-based collaborative tools as interactive collaborators rather than sidebar assistants, enabling multi-age...

2026-05-06 By Sean Weldon

Abstract

This synthesis examines the integration of AI agents into canvas-based collaborative environments, analyzing a paradigm shift from traditional sidebar assistants to spatially-aware, multi-agent systems. Through investigation of Teal Draw's evolutionary trajectory—from Make Real's single-agent structured generation (2023) to the Fairies multi-agent orchestration platform and desktop application integration—this work establishes how text-structured outputs, agentic loops, and leader-follower coordination patterns enable effective human-AI collaboration in visual environments. Key findings demonstrate the superiority of structured data generation over vision-based approaches for technical diagrams, the efficacy of canvas-embedded agents with spatial awareness for task coordination, and the viability of local-first architectures that maximize agent capabilities while containing security risks to individual user environments. These developments have significant implications for creative tooling, technical prototyping workflows, and the broader integration of autonomous agents into professional visual collaboration systems.

1. Introduction

The conventional paradigm for AI assistance positions language models as sidebar companions—separate interfaces that generate text or code for human review and subsequent manual integration into primary work environments. This architecture creates a fundamental disconnect between AI output and the spatial, visual contexts in which creative and technical work naturally occurs. Users experience a fragmented workflow, alternating between conversational interfaces and canvas-based tools, with limited capacity for AI systems to observe, understand, or directly manipulate visual artifacts.

The emergence of canvas-based AI integration represents a significant architectural departure from this model, enabling agents to function as collaborative entities with direct manipulation capabilities, spatial awareness, and real-time visual feedback mechanisms. Rather than operating as external consultants providing textual recommendations, these agents exist within the collaborative workspace itself, observing shared context and executing modifications directly on visual artifacts.

Teal Draw, a London-based startup offering both a free online whiteboard and a Software Development Kit (SDK), provides the technical foundation for this investigation. The platform's React-based component architecture enables runtime hackability and third-party integration, as demonstrated by implementations in Replit's agent canvas and Luma AI's canvas products. This synthesis examines the evolution of AI-canvas integration through three developmental stages: single-agent structured generation (Make Real, 2023), multi-agent orchestration (Fairies platform), and desktop application integration with expanded agent capabilities. The central thesis posits that canvas-embedded agents with spatial awareness and structured output capabilities enable more effective human-AI collaboration than traditional text-based interfaces, particularly for tasks requiring visual feedback, iterative refinement, and coordination among multiple autonomous systems.

2. Background and Related Work

2.1 Vision Model Constraints and Structured Data Generation

Vision models face inherent limitations when generating structured technical content for canvas-based applications. Training data for structured visual elements—including diagrams, charts, and wireframes—is substantially less abundant than text corpora, creating reliability challenges for image-based generation approaches. Furthermore, conflicting conventions within training data introduce systematic ambiguities: Cartesian coordinate systems position the Y-axis origin at the bottom with positive values ascending, while web coordinate systems place the origin at the top with positive values descending. These inconsistencies necessitate alternative approaches for reliable technical diagram generation.

The adoption of text-structured outputs—where models generate textual descriptions of canvas primitives (circles, rectangles, lines) rather than rendered images—addresses these limitations by leveraging the superior performance of language models on text generation tasks. This approach enables more predictable and consistent behavior, though it requires substantial prompt engineering to train models for reliable structured output generation.

2.2 Agentic Architecture Patterns

The agentic loop framework, adapted from coding agent conventions, structures AI behavior as iterative cycles: output generation → review → refinement until completion criteria are satisfied. This pattern enables progressive improvement and error correction, moving beyond single-shot generation toward more robust, production-ready outputs. In multi-agent contexts, leader-follower orchestration patterns enable task delegation, where an elected leader agent scouts canvas state, creates task allocation lists, and monitors completion and correctness across subordinate agents.

Local-first application architecture prioritizes file-based data storage and offline functionality over cloud-dependent systems. This design philosophy, previously considered idealistic, gains practical relevance when combined with agent integration, as it enables maximum agent capability while containing security risks to individual user environments rather than networked infrastructure.

3. Core Analysis

3.1 Make Real: Early Canvas-AI Integration and Vibe Coding

Make Real (2023) represented a foundational breakthrough in AI-canvas integration, described as "one of the first projects to break containment in AI" by enabling non-technical users to create functional prototypes from canvas drawings. The workflow enabled users to sketch interface designs on the canvas, transmit drawings to vision models, and receive working HTML prototypes with functional code. This process introduced the concept of vibe coding—technical creation without traditional code literacy requirements.

The system supported iterative refinement through annotation mechanisms, where users could draw modifications or add textual prompts directly atop generated outputs, constructing progressive enhancement cycles. However, the implementation revealed fundamental limitations in the keyboard-handoff model: users experienced the interaction as "handing my keyboard to some other AI rather than someone collaborating with me," highlighting the absence of genuine collaborative presence and shared spatial awareness.

3.2 Single-Agent Structured Output Generation

The evolution toward structured output generation addressed vision model limitations through architectural redesign. Rather than generating rendered images, AI agents produced text-structured descriptions that directly instantiated canvas primitives—geometric shapes, lines, text elements—through the platform's API. This approach solved multiple technical challenges simultaneously: it circumvented the scarcity of structured visual training data, eliminated coordinate system ambiguities, and enabled more predictable model behavior through prompt engineering.

Implementation required substantial prompt engineering to achieve consistent, reliable structured outputs. The system evolved from single-shot generation to agentic loops incorporating iterative review and refinement cycles, following established coding agent conventions. This iterative architecture enabled error detection and progressive improvement, producing more robust outputs than single-pass generation while maintaining the direct manipulation benefits of canvas-based interaction.

3.3 Multi-Agent Orchestration: The Fairies Platform

The Fairies platform represents a significant architectural advancement, implementing multi-agent orchestration directly within the canvas environment. Rather than positioning agents in sidebar interfaces, the system places multiple agents—termed "fairies"—directly on the canvas as visible, terminal-like windows. This spatial embedding enables spatial awareness: agents observe canvas state, monitor each other's work, and coordinate task execution without overlapping efforts.

The implementation employs a leader-follower orchestration pattern for complex task coordination. An elected leader agent scouts the current canvas state, generates a task delegation list, assigns tasks to follower agents, and monitors both completion status and correctness of subordinate work. This architecture enables parallel execution on complex tasks such as wireframe generation from textual descriptions, where multiple agents simultaneously develop different interface sections while maintaining coherent overall design.

The platform demonstrates effective blind-spot prevention through mutual observation: agents monitor each other's activities and avoid duplicate work through shared state awareness. This capability emerges from the spatial embedding architecture, which provides agents with visual context about ongoing work across the canvas. The system is publicly accessible at fairies.tldraw.com for experimentation and evaluation.

3.4 Desktop Integration and Expanded Agent Capabilities

Desktop application integration through an Electron wrapper fundamentally expands agent capabilities by removing web environment security constraints. The architecture implements an HTTP endpoint that accepts JavaScript code from Claude for direct execution against the Document Object Model (DOM) and browser APIs. This design enables bidirectional workflows: agents can visualize code as diagrams, then update source code to match subsequent diagram modifications—a capability impossible within browser security sandboxes.

The expanded runtime access enables agents to add interactivity primitives to static designs, including hover states and click handlers, despite Teal Draw lacking native support for these features. Agents accomplish this by writing runtime code against the editor's API, inferring missing primitives through creative use of available functionality. The system demonstrates remarkable agent willingness to perform script injection and system file modification, including examples of agents modifying Spotify application files when instructed.

This architecture embodies a sharp tools philosophy: maximizing agent capability while accepting user responsibility for risk management. The local-first, file-based approach contains potential damage to individual user environments rather than networked systems, making previously unsafe operations viable in isolated contexts. As articulated in the source material: "If you really want to maximize the agency in order to maximize what it can do and take the risk and take on that risk, then you just need to hand that to the user and say, good luck."

4. Technical Insights

4.1 Structured Output Architecture

Text-structured outputs provide superior reliability for technical diagram generation compared to vision-based image synthesis. Vision models exhibit limited training data for structured visual content and face conflicting conventions in coordinate systems—Cartesian graphs position Y-axis origins at the bottom, while web coordinates place them at the top. Generating textual descriptions of canvas primitives (circles, rectangles, lines with explicit coordinates) leverages language models' superior text generation capabilities while bypassing vision model limitations.

Implementation requires extensive prompt engineering to achieve predictable structured output formatting. Models must be trained through prompt design to consistently generate valid primitive descriptions, maintain coordinate system conventions, and produce parseable output structures. The investment in prompt engineering yields substantial reliability improvements over direct image generation approaches.

4.2 Multi-Agent Coordination Mechanisms

Multi-agent systems require explicit coordination mechanisms to prevent task overlap and ensure coherent output. The leader-follower pattern implements this through centralized task allocation: a leader agent surveys canvas state, decomposes complex tasks into subtasks, assigns work to followers, and monitors completion. This architecture prevents duplicate work while enabling parallel execution.

Spatial awareness emerges as a critical capability for effective coordination. Canvas-embedded agents with visual access to ongoing work can observe each other's activities, identify coverage gaps, and avoid redundant efforts without explicit inter-agent communication protocols. This implicit coordination through shared visual context proves more robust than message-passing architectures in dynamic environments.

4.3 Desktop Runtime Architecture

The HTTP endpoint approach for desktop integration enables arbitrary JavaScript execution against Electron application runtimes. Claude generates code snippets that execute with full DOM access, file system permissions, and browser API availability. This architecture removes web security constraints while containing risk through application-level isolation.

The design enables agents to infer and implement missing UI primitives by writing runtime code against available APIs. When native hover state or click handler support is unavailable, agents generate JavaScript that monitors events and modifies canvas state accordingly, effectively extending the platform's capabilities through runtime programming rather than framework modification.

4.4 Trade-offs and Limitations

The sharp tools philosophy accepts significant security trade-offs in exchange for maximum agent capability. Desktop applications permit script injection, system file modification, and arbitrary code execution—operations that would constitute critical vulnerabilities in web contexts. The local-first architecture contains these risks to individual user environments, but users bear full responsibility for managing potential damage from agent actions.

Structured output generation requires substantial upfront prompt engineering investment and remains sensitive to prompt design quality. Vision-based approaches, while less reliable for technical diagrams, may prove superior for artistic or photorealistic content where structured primitives cannot adequately represent desired outputs.

5. Discussion

The evolution from Make Real's single-agent generation to Fairies' multi-agent orchestration and desktop integration reveals a fundamental shift in human-AI collaboration paradigms. Traditional sidebar assistants position AI as external consultants providing recommendations for manual integration, while canvas-embedded agents function as collaborative peers with shared spatial context and direct manipulation capabilities. This architectural difference proves consequential for workflow effectiveness: agents with visual awareness of work-in-progress can provide contextual assistance, iterate on existing artifacts, and coordinate with other agents through implicit observation rather than explicit messaging.

The superiority of structured output generation over vision-based approaches for technical diagrams suggests broader implications for AI system design. Rather than pursuing general-purpose vision models for all visual generation tasks, specialized architectures leveraging structured intermediate representations may achieve superior reliability for specific domains. This finding aligns with broader trends toward task-specific model architectures rather than universal general-purpose systems.

The sharp tools philosophy embodied in desktop integration raises important questions about appropriate safety boundaries for agent systems. The local-first approach demonstrates one viable model: maximize capability while containing risk through isolation. However, this model assumes sophisticated users capable of managing potential damage—an assumption that may not hold for consumer applications. Future research should investigate graduated capability models that balance accessibility with safety, potentially through permission systems, sandboxing levels, or undo mechanisms that mitigate agent errors.

The practical viability of previously idealistic concepts—local-first architecture, file-over-app design—when combined with agent integration suggests that AI capabilities may resurrect abandoned design paradigms. Concepts dismissed as impractical in purely human-operated contexts may become viable when agents can automate complex coordination tasks that previously required centralized infrastructure.

6. Conclusion

This synthesis establishes canvas-based AI integration as a viable alternative to traditional sidebar assistant architectures, demonstrating superior effectiveness for visual collaboration tasks through spatial awareness, structured output generation, and multi-agent coordination capabilities. The evolution from Make Real's single-agent prototyping to Fairies' multi-agent orchestration and desktop integration with expanded capabilities reveals a progression toward genuine collaborative presence rather than external consultation.

Key technical contributions include the demonstration of structured output superiority over vision-based generation for technical diagrams, the efficacy of leader-follower orchestration patterns for multi-agent coordination, and the viability of local-first architectures that maximize agent capabilities while containing security risks through application isolation. These findings have immediate practical applications for creative tooling, technical prototyping workflows, and professional visual collaboration systems.

Future development should investigate graduated capability models that balance accessibility with safety, explore hybrid approaches combining structured and vision-based generation for different content types, and examine how canvas-based collaboration patterns extend to other spatial computing contexts including virtual and augmented reality environments. The demonstrated effectiveness of spatially-aware agents suggests broader applicability beyond two-dimensional canvases to any collaborative environment where visual context and direct manipulation prove central to human workflow.

Sources

Agents on the Canvas in tldraw — Steve Ruiz, tldraw - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub