'Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase'
Skills are a powerful mechanism for guiding AI agents through progressive disclosure of context and structured information, and they should be tested systema...
By Sean WeldonSystematic Evaluation and Implementation of Skills for AI Agent Enhancement
Abstract
This paper examines skills as a structured mechanism for enhancing AI agent capabilities through progressive disclosure of context and systematic workflow definition. Skills represent a folder-based architecture containing instructions and reference files that enable agents to access domain-specific knowledge on demand rather than loading complete context upfront. The analysis distinguishes skills from Model Context Protocol (MCP) tools, establishes evaluation frameworks for testing nondeterministic agent behaviors, and demonstrates implementation through a PostgreSQL security use case. Key findings indicate that skills consume approximately 1,300 tokens when loaded—significantly less than comprehensive tool loading—while progressive disclosure enables efficient context management. The research emphasizes that production deployment requires treating skills as continuously maintained documentation artifacts with systematic evaluation cycles before release, following test-driven development principles adapted for nondeterministic systems.
1. Introduction
The deployment of AI agents in production environments has created fundamental challenges in balancing comprehensive contextual information against limited context window constraints. Agents require access to domain-specific knowledge, procedural workflows, and technical documentation to perform reliably, yet loading all potentially relevant information upfront quickly exhausts available context capacity and degrades performance.
Skills emerge as a structured solution to this challenge, representing folders containing instructions and files that deliver custom information or workflows to agents through progressive disclosure patterns. Unlike traditional documentation or tool descriptions that must be fully loaded into context, skills employ a two-tier architecture that loads minimal essential information initially while maintaining access to comprehensive reference materials on demand.
This synthesis examines the architecture, evaluation methodologies, and production implementation strategies for agent skills. The analysis addresses three central questions: How do skills structure information for efficient agent consumption while maintaining comprehensive coverage? What systematic evaluation frameworks ensure reliable skill performance before production deployment in nondeterministic environments? How should organizations maintain skills as living documentation artifacts that evolve with product changes? The following sections establish theoretical foundations for progressive disclosure, analyze the complementary relationship between skills and MCP tools, present evaluation frameworks for agent behaviors, and demonstrate practical implementation through a security-focused use case.
2. Background and Related Work
2.1 Progressive Disclosure Framework
Progressive disclosure constitutes a design pattern wherein agents initially load minimal essential information, subsequently requesting additional details on demand rather than consuming complete context upfront. This approach addresses the fundamental constraint of limited context windows in large language models while maintaining agent access to comprehensive information resources. As defined in the source material, "progressive disclosure is when the agent or all the information about a subject is not loaded straight to context. Instead you just load the exact amounts of information that allows the agent to choose to load the rest of the information once it actually needs it."
The pattern manifests in skills through front matter containing name and description fields that load immediately, while detailed reference files remain accessible but unloaded until explicitly requested by the agent. This structure creates a graph-like information architecture analogous to a book with chapters that cross-reference each other, where skill.md acts as an index with links to reference files including additional markdown documents or executable scripts.
2.2 Agent Skills Open Standard and Evaluation Frameworks
The Agent Skills Open Standard defines structured approaches to skill creation and systematic testing. The standard recommends an eval.json structure containing prompts, expected outputs, LLM-as-judge criteria, and tool call assertions. This framework addresses the fundamental challenge that traditional unit testing approaches—which rely on deterministic outputs—prove inadequate for evaluating nondeterministic agent behaviors.
The LLM-as-judge technique employs a separate language model to assess whether outputs satisfy defined success criteria rather than requiring exact output matching. This pattern enables systematic evaluation of agent reasoning and behavior, though it introduces challenges including potential hallucination or misinterpretation of success criteria by the evaluating model.
3. Core Analysis
3.1 Skills Architecture and Information Structure
Skills comprise a mandatory skill.md file functioning as an index, supplemented by optional reference files and executable scripts. This architecture enables three distinct information delivery modes: immediate loading of essential metadata through front matter, on-demand loading of reference documentation, and local execution of environment-specific scripts.
The architectural distinction between skills and MCP tools proves significant for implementation decisions. MCP tools execute server-side without requiring local environment configuration, while skill scripts run locally and depend on the specific operating system environment (Linux, macOS, Windows). As the source material emphasizes, "the main difference is that tools don't need an environment to run. The agent can just call a tool... while scripts are loaded into your machine, they run on your local environment and they're tied to whatever environment that you have."
This architectural difference establishes skills and MCP as complementary rather than competitive technologies. Skills provide context and define workflows that exceed the scope of tool descriptions, while MCP tools handle integrations and service interactions. The appropriate framework becomes "MCP versus CLI, not MCP versus skills," with optimal implementations employing both technologies according to their respective strengths.
3.2 Context Efficiency and Progressive Disclosure Mechanics
Quantitative analysis reveals significant context efficiency advantages for skills compared to comprehensive tool loading. Skills consume approximately 1,300 tokens when loaded, substantially less than loading all MCP tools into context. The Supabase MCP server demonstrates this trade-off by exposing approximately 20-29 tools for database operations including table listing, SQL execution, migrations, and database advisor functionality—each requiring description text that accumulates in the context window.
Progressive disclosure enables accumulation of numerous skills in local development environments without severe context penalties, as only front matter (name and description fields) loads initially. However, production environments require different optimization strategies. The source material recommends treating skills as CI/CD artifacts: "keep only necessary skills, maintain them like documentation, and update them when features or workflows change."
3.3 Agent Recognition and Loading Mechanisms
Empirical observations indicate that skill description language significantly affects agent recognition rates. Descriptions employing action verbs—particularly "use"—demonstrate higher recognition rates by Claude compared to passive descriptions, suggesting training data patterns around verb usage. The source material notes that "using verbs like 'use' in skill descriptions increases the likelihood of agent recognition and loading compared to passive descriptions."
Three mechanisms ensure skill loading with varying reliability levels: automatic detection via description matching (dependent on agent capability), explicit inclusion of "use" keyword plus skill name in prompts (higher reliability), and slash command invocation (100% guarantee). The skills npm package from Vercel facilitates installation through automatic detection of local versus remote repositories, creating symbolic links to the cloud/skills directory for agent discovery.
3.4 Systematic Evaluation Framework
The OpenAI-proposed evaluation framework establishes a systematic cycle: define metrics → create skill → run evaluations → grade results → iterate. This approach adapts test-driven development principles for nondeterministic systems, emphasizing metric definition before implementation.
Evaluations operate at multiple granularity levels including unit tests, integration tests, and end-to-end testing. The Agent Skills Open Standard's eval.json structure standardizes evaluation definition with four components: prompt specifications, expected output descriptions, LLM-as-judge criteria, and tool call assertions. Brain Trust emerges as an observability platform that systematically runs evaluations and provides comprehensive visibility into agent behavior during controlled scenarios.
A critical evaluation design requirement involves deterministic setup mechanisms—such as database reset scripts—to ensure consistent starting states across multiple evaluation runs. The source material identifies scenario design as "the most difficult part to create evals," requiring deep understanding of expected agent behaviors and realistic use case coverage. Comparative evaluation (testing with and without skills) proves essential for measuring actual impact on agent behavior rather than assuming effectiveness.
4. Technical Insights
4.1 PostgreSQL Security Implementation Case Study
The practical demonstration reveals a concrete security vulnerability: PostgreSQL views created without the security_invoker flag bypass row-level security (RLS) policies, potentially exposing unauthorized data access. This flag must be explicitly set to transfer RLS policies to views, a requirement introduced in PostgreSQL 15.
The implemented security skill documented three checkpoints: enable RLS on public schemas, use the security_invoker flag when creating views, and run database advisor to identify vulnerabilities. Live testing demonstrated measurable behavior change: the agent initially created a view without the security flag, but after skill installation, created the same view with proper security_invoker flag enabled. This empirical result validates the skill's effectiveness in modifying agent behavior for security-critical operations.
4.2 Production Deployment Considerations
Production deployment requires treating skills as living documentation artifacts rather than static resources. The source material recommends creating monitoring jobs to track skill usage over time, removing skills that remain unloaded by users for extended periods. This maintenance approach prevents context accumulation while ensuring deployed skills remain current with product features and workflows.
Export strategies should integrate skills into documentation update workflows, including them in cloud.mmd or agents.mmd files as documentation artifacts. This integration ensures skills evolve synchronously with product changes, maintaining accuracy and relevance. The CI/CD artifact treatment model emphasizes minimal, curated production environments contrasted with more permissive local development configurations.
4.3 Tool Search and Progressive Disclosure Extensions
Cloud Code's tool_search feature extends progressive disclosure principles to MCP tools, enabling similar on-demand loading patterns. However, this capability remains Cloud Code-specific rather than standardized across MCP clients, suggesting potential future standardization opportunities. This feature demonstrates convergence between skills and MCP approaches, with progressive disclosure patterns proving valuable across both architectures.
5. Discussion
The analysis reveals skills as a fundamental mechanism for managing the tension between comprehensive agent context and limited window capacity. Progressive disclosure patterns prove essential for scaling agent capabilities beyond initial context constraints, though this approach introduces complexity in ensuring agents correctly identify and load relevant skills when needed.
The complementary relationship between skills and MCP tools suggests a layered architecture for agent enhancement: MCP handles integrations and service interactions requiring server-side execution, while skills provide contextual knowledge and workflow definitions that guide agent reasoning. This separation of concerns enables optimization of each layer according to its specific requirements and constraints.
The systematic evaluation framework addresses a critical gap in agent development practices. Traditional software testing approaches prove inadequate for nondeterministic systems, necessitating LLM-as-judge patterns and comparative evaluation methodologies. However, the potential for evaluating models to hallucinate or misinterpret success criteria indicates ongoing challenges in establishing reliable quality assurance processes for agent behaviors.
Future research directions include standardization of progressive disclosure patterns across MCP clients, development of more sophisticated agent recognition mechanisms that reduce reliance on specific description language patterns, and establishment of industry-wide evaluation standards for agent skills. The treatment of skills as documentation artifacts that require continuous maintenance suggests organizational process implications beyond purely technical considerations.
6. Conclusion
This analysis establishes skills as a structured mechanism for enhancing agent capabilities through progressive disclosure while maintaining context efficiency. The folder-based architecture containing skill.md indexes and reference files enables agents to access comprehensive information on demand, consuming approximately 1,300 tokens when loaded compared to significantly higher costs for comprehensive tool loading.
The systematic evaluation framework proves essential for reliable production deployment, adapting test-driven development principles for nondeterministic systems through LLM-as-judge patterns and comparative testing methodologies. The PostgreSQL security case study demonstrates measurable behavior modification, validating skills' effectiveness for guiding agents through complex technical workflows.
Practical implementation requires treating skills as continuously maintained documentation artifacts integrated into CI/CD workflows, with production environments maintaining minimal curated skill sets while local development permits broader accumulation. Organizations adopting skills should establish evaluation cycles before production deployment, monitor usage patterns for maintenance decisions, and recognize the complementary relationship between skills and MCP tools rather than viewing them as competing approaches. The convergence of progressive disclosure patterns across both skills and emerging MCP client features suggests continued evolution toward standardized approaches for managing agent context and capabilities.
Sources
- Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase - Original Creator (YouTube)
- Analysis and summary by Sean Weldon using AI-assisted research tools
About the Author
Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.