Agentic Search for Context Engineering — Leonie Monigatti, Elastic

Context engineering is fundamentally about agentic search—the ability to intelligently select which search tools and techniques to use for retrieving relevan...

By Sean Weldon

Agentic Search for Context Engineering: A Systematic Analysis of Tool Selection and Retrieval Architectures

Abstract

This paper examines the evolution from fixed Retrieval-Augmented Generation (RAG) pipelines to agentic search architectures, where autonomous agents dynamically select appropriate search tools for context retrieval. The analysis establishes that effective context engineering consists primarily of intelligent search tool selection, representing approximately 80% of the engineering challenge. Through systematic examination of failure modes across diverse context sources—including databases, file systems, web resources, and memory stores—this work identifies three primary failure categories: tool selection errors, parameter generation failures, and inadequate tool descriptions. Key findings demonstrate that balanced tool stacks combining specialized tools (low floor) with general-purpose query execution capabilities (high ceiling) outperform single-solution approaches. The research provides actionable design principles for tool descriptions, progressive disclosure mechanisms, and hybrid verification strategies, offering practical guidance for implementing robust agentic search systems.

1. Introduction

The advancement of Large Language Models (LLMs) has necessitated sophisticated mechanisms for augmenting model capabilities with external knowledge sources. Retrieval-Augmented Generation (RAG) emerged as a foundational approach, enabling models to access information beyond their parametric knowledge. However, early RAG implementations employed fixed retrieval pipelines that mechanistically converted user queries to vector searches, executing retrieval operations regardless of contextual necessity. This architectural rigidity created systematic failures in both over-retrieval scenarios, where irrelevant context degraded model performance, and under-retrieval scenarios, where complex information needs required iterative, multi-hop retrieval operations.

Context engineering—defined as the disciplined practice of intelligently selecting and retrieving relevant information from diverse sources—has evolved to address these limitations through agentic search architectures. In these systems, autonomous agents determine whether retrieval is necessary, select appropriate search tools from available options, and execute iterative searches when initial results prove insufficient. This paradigm shift introduces substantial complexity: agents must navigate multiple context sources (local files, databases, web resources, working memory, skill repositories, and long-term storage), each requiring specialized search techniques ranging from vector similarity to SQL queries to shell commands.

This analysis proceeds through systematic examination of agentic search architectures, establishing that search tool selection constitutes the predominant challenge in context engineering. The investigation addresses four primary dimensions: the architectural evolution from fixed to agentic retrieval, the landscape of search tools across diverse context sources, systematic failure modes and mitigation strategies, and practical design principles for balanced tool stack construction. Through this examination, the work provides evidence-based guidance for implementing effective agentic search systems.

2. Background and Related Work

Traditional RAG architectures employed deterministic pipelines where user inputs automatically triggered vector database queries. This approach, while operationally straightforward, suffered from fundamental limitations. First, unnecessary retrieval operations introduced irrelevant context that confused LLMs, degrading response quality. Second, fixed pipelines could not accommodate multi-hop retrieval—scenarios requiring iterative searches based on intermediate results or follow-up queries necessitated by initial findings.

The transition to agentic RAG frameworks addresses these constraints by treating retrieval as a decision-making process within the agent's action space. Agents receive search capabilities as tools, enabling dynamic retrieval strategies where the agent determines if, when, and how to retrieve context. This architectural shift aligns with broader developments in tool-augmented language models, where LLMs orchestrate external capabilities rather than operating in isolation. Contemporary systems must support retrieval across heterogeneous context sources: structured databases amenable to SQL queries, unstructured file systems requiring semantic or keyword search, web resources accessed through search APIs, ephemeral working memory for conversation state, skill repositories containing procedural knowledge, and long-term memory stores for persistent information. Each context source presents unique search requirements, demanding specialized tools and techniques.

3. Core Analysis

3.1 Architectural Evolution and Failure Modes

The transition from fixed RAG pipelines to agentic search introduces three primary failure modes that systematically impact retrieval effectiveness. First, agents may fail to invoke any search tool, incorrectly concluding they possess sufficient parametric knowledge to respond without external context. This failure mode represents a fundamental judgment error in determining retrieval necessity. Second, agents may invoke inappropriate tools when multiple options exist—for instance, selecting web search capabilities when database queries would provide more accurate results. Third, agents may generate incorrect parameters for selected tools, particularly when tools require complex inputs such as complete SQL or ESQL queries.

These failure modes become increasingly prevalent as systems scale from single database context sources to multiple heterogeneous sources. The original fixed pipeline approach, while limited, avoided tool selection errors through its deterministic nature. Agentic systems trade this reliability for flexibility, requiring sophisticated mechanisms to guide tool selection and parameter generation. The complexity of search itself—evidenced by the proliferation of techniques including vector search, keyword search, dense and sparse embeddings, multi-vector embeddings, and various indexing strategies—underscores why tool selection represents such a substantial engineering challenge.

3.2 Tool Design Principles and Description Frameworks

Tool descriptions emerge as the most critical yet frequently neglected component of agentic search systems. Observations of production implementations reveal that tool descriptions typically consist of minimal single-sentence explanations, insufficient for guiding agents when multiple tools offer overlapping capabilities. Comprehensive tool descriptions should incorporate four essential elements: core purpose articulating the tool's primary function, trigger conditions specifying when the tool should and should not be used, relationships to other tools clarifying interactions and dependencies, and confirmation requirements indicating when additional verification is necessary.

The distinction between specialized and general-purpose tools manifests in their parameter complexity and capability scope. Specialized tools such as get_customer_by_id or basic semantic search operations require minimal parameters and straightforward invocation, providing a low floor that enables reliable agent usage even with less capable models. Conversely, general-purpose tools allowing agents to construct complete queries in SQL or ESQL demand more sophisticated parameter generation and provide a high ceiling for complex operations. This dichotomy necessitates balanced tool stacks: specialized tools handle common patterns efficiently with minimal error rates, while general-purpose tools accommodate unexpected or complex queries that specialized tools cannot address.

3.3 Semantic Search Implementation and Limitations

Semantic search tools, which embed text fields using models such as nomic-embeddings-v5 and perform similarity searches with configurable top-K limits, exhibit systematic limitations that constrain their applicability. First, semantic search fails for keyword-specific queries where exact term matching is required—searching for the acronym "GDPA" may return semantically similar but factually irrelevant results about "Gemma models" rather than the intended regulatory framework. Second, semantic search operates exclusively on embedded text fields; metadata fields remain unembedded and support only filtering operations, not semantic matching. Third, restrictive top-K configurations (e.g., top-K=3) limit agent flexibility, potentially requiring multiple iterative searches when initial results prove insufficient.

These limitations necessitate complementary search modalities. Keyword search via tools like grep provides exact matching capabilities essential for acronym searches, technical identifiers, and specific terminology. However, grep-based approaches lack true semantic understanding—agents attempting semantic search through shell tools resort to synonym chaining, sequentially searching for related terms like "regulate," "compliance," "GDPR," and "governance" rather than performing genuine semantic matching. Alternative semantic search tools for local files, including gina-grap and coal-grap, employ multi-vector embeddings to provide efficient semantic search without synonym enumeration, representing a middle ground between basic keyword matching and full vector database implementations.

3.4 General-Purpose Query Execution and Error Handling

General-purpose query execution tools that enable agents to write complete search queries in languages like SQL or ESQL provide maximum flexibility at the cost of increased complexity. These tools require more capable models to generate syntactically and semantically correct queries, and demand robust error handling mechanisms. Critical to their success is the implementation of try-except blocks that return error messages to agents rather than terminating execution—this enables self-correction where agents analyze error responses and reformulate queries.

The agent skills pattern implements progressive disclosure for complex query languages: skill names and brief descriptions appear in system prompts, while full documentation loads into the context window on-demand when agents request specific skills. Skills should incorporate syntax rules, practical examples, and usage guidelines that enable agents to construct correct queries. Common errors, such as agents using SQL wildcard syntax (%) in ESQL queries that require asterisk (*) wildcards, demonstrate the necessity of explicit syntax documentation. Furthermore, general-purpose query tools enable agents to perform aggregations and calculations within search operations, avoiding context window bloat and leveraging database capabilities rather than relying on LLM arithmetic, which exhibits known weaknesses in counting and numerical operations.

4. Technical Insights

The implementation of agentic search systems requires careful consideration of multiple technical dimensions. Vector embeddings for semantic search utilize models such as nomic-embeddings-v5 to transform text fields into high-dimensional representations, enabling similarity-based retrieval. However, the embedding process applies only to designated text fields; metadata remains in its original form and supports filtering but not semantic matching. This architectural constraint necessitates hybrid approaches where semantic search on embedded content combines with metadata filtering.

Error handling architecture proves critical for general-purpose query tools. Rather than system termination upon query errors, implementations should return structured error messages to agents, enabling iterative refinement. Agents demonstrate surprising capability in error interpretation and query correction when provided with informative error responses, particularly when syntax errors or logical inconsistencies occur in SQL or ESQL queries.

Shell tools (bash, exec, terminal) provide exceptional versatility, enabling agents to navigate file systems, execute command-line interfaces, and interact with databases through terminal commands. However, these tools introduce substantial security risks—agents may inadvertently delete files or execute unintended commands. Shell tools must operate exclusively in sandboxed environments with appropriate permission restrictions. Despite these risks, shell tools enable sophisticated operations including semantic search approximations through grep synonym chaining, though true semantic search tools like gina-grap provide superior efficiency through multi-vector embeddings.

Context window management emerges as a critical concern in long-running conversations utilizing multiple skills. Progressive disclosure mechanisms maintain compact system prompts by loading skill documentation only when needed. File stores can offload skill definitions and previous tool results, retrieving them into the context window on-demand. Context compaction strategies become necessary as conversations accumulate tool invocations and results, requiring selective retention of relevant information and strategic offloading of obsolete content.

The hybrid tool approach, combining multiple search modalities for verification and cross-validation, achieves higher accuracy than single-tool implementations. For instance, database queries verified through shell tool file system checks provide redundant validation that catches errors in either retrieval path. This approach trades computational efficiency for reliability, a worthwhile tradeoff in high-stakes applications where retrieval accuracy directly impacts system trustworthiness.

5. Discussion

The findings establish that effective agentic search systems require careful balance between specialization and generality in tool design. The low floor/high ceiling pattern provides a principled framework: specialized tools with minimal parameters handle common cases reliably, while general-purpose tools accommodate edge cases and complex queries that exceed specialized tool capabilities. This architectural principle extends beyond search to general tool design for agentic systems.

The predominance of search in context engineering—estimated at approximately 80% of the engineering challenge—suggests that advances in search tool design and agent guidance mechanisms will yield substantial improvements in overall system performance. Tool descriptions emerge as a high-leverage intervention point: comprehensive descriptions incorporating trigger conditions and tool relationships significantly improve agent tool selection without requiring model improvements or architectural changes.

Several areas warrant further investigation. First, the optimal balance between specialized and general-purpose tools likely varies across application domains, suggesting domain-specific tool stack design principles. Second, the relationship between model capability and complex tool usage remains partially characterized—while more powerful models reduce parameter generation errors, the magnitude of improvement and cost-benefit tradeoffs require systematic evaluation. Third, the security implications of shell tools in production systems necessitate rigorous sandboxing strategies and permission models that preserve utility while preventing harmful operations.

The progressive disclosure pattern for skill management demonstrates broader applicability beyond search tools. As agent systems accumulate larger skill repositories, mechanisms for selective skill loading and context window management become increasingly critical. The tension between comprehensive tool documentation and context window constraints suggests opportunities for hierarchical skill organization and adaptive loading strategies based on conversation progression.

6. Conclusion

This analysis establishes that agentic search represents the central challenge in context engineering, requiring sophisticated tool selection mechanisms across diverse context sources. The research identifies three primary failure modes—tool selection errors, parameter generation failures, and inadequate tool descriptions—and provides evidence-based mitigation strategies including comprehensive tool descriptions, progressive disclosure patterns, and hybrid verification approaches.

Key practical contributions include the low floor/high ceiling design pattern for balanced tool stacks, the four-element framework for tool descriptions (core purpose, trigger conditions, tool relationships, confirmation requirements), and the progressive disclosure mechanism for managing complex skill repositories. These principles enable practitioners to design robust agentic search systems capable of handling diverse retrieval requirements across structured and unstructured context sources.

The findings suggest that no silver-bullet search solution exists; effective systems require curated tool sets combining specialized and general-purpose capabilities. Organizations implementing agentic search should begin with general-purpose tools to understand agent behavior patterns, then introduce specialized tools for common operations identified through behavioral logging. This empirical approach to tool stack design, coupled with comprehensive tool descriptions and robust error handling, provides a practical pathway to reliable agentic search implementations. Future work should systematically evaluate tool stack configurations across application domains, quantify the relationship between model capability and complex tool usage, and develop security frameworks for safely deploying versatile tools like shell access in production environments.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub