RAG is dead, right?? — Kuba Rogut, Turbopuffer

RAG is not dead but evolving—simple vector search RAG is being replaced by sophisticated agentic retrieval systems that iteratively search and reason over mu...

2026-06-14 By Sean Weldon

The Evolution of Retrieval Augmented Generation: From Simple Vector Search to Agentic Retrieval Systems

Abstract

This synthesis examines the architectural evolution of Retrieval Augmented Generation (RAG) systems from simple vector search implementations to sophisticated agentic retrieval frameworks. Through comparative analysis of production implementations—specifically Cursor's indexed semantic search approach and Cloud Code's grep-based methodology—this work establishes that embeddings function as cached compute, amortizing upfront indexing costs across multiple queries while reducing per-query token consumption. Empirical evidence from Cursor's deployment demonstrates 12.5-13.5% average accuracy improvements across models, with the Composer model achieving 24% gains, alongside 2.6% code retention improvements in large codebases. The analysis reveals that effective retrieval in trillion-token contexts requires staged mechanisms to identify relevant subsets rather than maximizing context window utilization, and that modern retrieval has evolved from one-time vector database queries to iterative, multi-modal discovery processes where agents progressively reason over context.

1. Introduction

The discourse surrounding Retrieval Augmented Generation (RAG) systems has generated considerable debate within the artificial intelligence research community, with some practitioners declaring the approach obsolete while others advocate for its continued relevance. This apparent contradiction stems from terminological ambiguity and the rapid evolution of retrieval architectures between 2023 and 2024. The central thesis of this analysis posits that RAG is not obsolete but rather undergoing fundamental transformation from simple vector search implementations to sophisticated agentic retrieval systems that iteratively search and reason over multiple modalities.

Terminological precision is essential for understanding this evolution. Retrieval Augmented Generation (RAG) encompasses multiple retrieval modalities including vector search, full-text search algorithms such as BM25, pattern matching through regular expressions and globbing, and basic filtering mechanisms, all integrated with large language model generation. In contrast, agentic search represents an advanced paradigm wherein autonomous agents employ multiple tools to progressively and iteratively discover and reason over context across sequential steps. Importantly, agentic search is not merely file system grep operations but rather a sophisticated process where agents read files, assess relevance, determine information sufficiency, and continue searching until reaching satisfactory states.

This synthesis examines the technical evolution of RAG systems through production case studies, establishes the economic and performance implications of different architectural choices, and synthesizes broader implications for retrieval system design in the context of expanding language model capabilities.

2. Background and Related Work

The conceptual framework of RAG emerged as a solution to the knowledge limitations inherent in pre-trained language models. Traditional RAG implementations performed single-pass vector similarity searches to augment model context windows with relevant information. This approach, characterized as "simple RAG," proved effective for straightforward question-answering tasks during 2023 and early 2024 but demonstrated limitations when applied to complex, multi-step reasoning scenarios requiring deeper contextual understanding.

The concept of staged retrieval, articulated by Jeff Dean at Google, addresses the challenge of information discovery in massive corpora. Dean's formulation emphasizes that effective retrieval requires identifying "the right million" tokens from trillions available rather than maximizing absolute context window utilization. This principle underlies contemporary approaches to retrieval system design, particularly as context windows expand to unprecedented scales, with models like Gemini approaching trillion-token capacities. The staged retrieval framework recognizes that large context windows are ineffective without mechanisms to narrow down relevant information subsets—the critical capability is efficiently reducing trillions of available tokens to the right 100,000 or 10,000 tokens for specific queries.

Merkle trees, cryptographic hash tree structures originally developed for data verification, have found novel application in code indexing systems. These structures enable efficient similarity calculations between codebases, facilitating intelligent caching strategies that avoid redundant computational operations when team members work on identical or highly similar code repositories.

3. Core Analysis

3.1 Comparative Implementation Architectures

Two distinct architectural approaches to code retrieval systems illuminate the trade-offs between indexed and per-session discovery methods. Cursor's implementation exemplifies the indexed semantic search approach: the system chunks, parses, and embeds codebases to enable semantic search capabilities. Critically, Cursor employs Merkle trees to calculate similarities between codebases across team members, avoiding redundant re-chunking and re-embedding operations when codebases are identical or similar. The system only re-chunks and re-embeds files that have changed, substantially reducing computational overhead.

In contrast, Cloud Code employs a grep-based approach that does not utilize vector search or RAG techniques. Early iterations of Cloud Code attempted RAG implementations with local vector databases but found them ineffective for their specific use case. The current architecture relies on file system grep and iterative file reading, requiring per-session discovery where agents grep, read, assess, and repeat steps for each query. Notably, this approach incurs repeated computational costs even when multiple agents ask identical questions across different sessions.

3.2 Embeddings as Cached Compute

The conceptual framework of embeddings as cached compute provides analytical clarity regarding the economic trade-offs between these architectural approaches. Embeddings and semantic search function as cached computation—upfront indexing costs are amortized across multiple runtime queries. The Cloud Code approach incurs repeated token costs, approximately 6,000 tokens per sub-step, across multiple agent sessions asking the same question. In contrast, Cursor's indexed approach requires one-time parsing and embedding costs, after which lightweight retrieval at runtime saves tokens, time, and computational resources per query.

This economic advantage translates to performance gains. Indexed semantic search enables faster agent performance and has driven internal adoption at Turbo Puffer over grep-based approaches. The upfront investment in indexing infrastructure yields compounding returns as query volume increases, particularly in scenarios where similar questions are posed across multiple sessions or users.

3.3 Empirical Performance Evaluation

Quantitative evaluation of Cursor's semantic search integration provides empirical validation of the indexed approach. The integration achieved 12.5-13.5% average increase in answer accuracy across models, with the Cursor Composer model demonstrating a 24% improvement. Online A/B testing revealed 2.6% code retention improvement in large codebases and 2.2% decrease in dissatisfied user requests.

Importantly, semantic search is not utilized in every query, which explains why aggregate metrics appear modest despite significant per-query improvements. This selective deployment suggests that effective retrieval systems must intelligently determine when different retrieval modalities are appropriate rather than uniformly applying a single approach across all queries.

3.4 From Simple RAG to Iterative Agentic Retrieval

The evolution from simple RAG to agentic retrieval represents a fundamental shift in retrieval system architecture. Simple RAG, characterized by a single vector search call into the context window, proved sufficient for basic use cases in 2023 and early 2024. However, contemporary sophisticated applications require multiple iterative calls, semantic and full-text search as needed, and selective fetching based on specific use cases.

Modern agentic retrieval transforms retrieval from a one-time operation into a continuous, iterative process where agents reason through several steps to understand what to search for next. Agents are "searching to understand more"—using retrieval as a mechanism for progressive context discovery rather than static information lookup. This iterative approach enables agents to refine their understanding of information needs based on intermediate retrieval results, creating a feedback loop that progressively narrows the search space toward maximally relevant context.

4. Technical Insights

Several technical considerations emerge from this analysis with direct implications for retrieval system implementation. First, the application of Merkle trees for codebase similarity calculation represents an efficient approach to avoiding redundant re-embedding operations in collaborative development environments. This technique is particularly valuable in scenarios where multiple users work on similar or identical codebases, as it enables intelligent sharing of indexing computation across users.

Second, the architecture of Turbo Puffer as a full-text search and vector search database built on object storage demonstrates the feasibility of combining multiple retrieval modalities within a unified system. This multi-modal approach enables systems to select appropriate retrieval mechanisms based on query characteristics rather than forcing all queries through a single retrieval pathway.

Third, the trade-off between upfront indexing costs and per-query token consumption must be evaluated in the context of expected query volume and similarity. For applications with high query volume or substantial query overlap across sessions, indexed approaches amortize costs effectively. Conversely, for applications with highly diverse, one-time queries, per-session discovery approaches may prove more economical despite higher per-query costs.

Fourth, the selective deployment of semantic search—where it is not utilized for every query—suggests that effective retrieval systems should incorporate decision logic to determine when different retrieval modalities are appropriate. This meta-level reasoning about retrieval strategy selection represents an important area for system optimization.

5. Discussion

The findings presented in this analysis reveal several broader implications for the development of retrieval systems in the context of increasingly capable language models. The evolution from simple RAG to agentic retrieval reflects a fundamental shift in how retrieval systems interface with language models—from passive information providers to active reasoning partners that iteratively refine context discovery. This shift aligns with broader trends toward agentic AI systems that employ tools to accomplish complex tasks through multi-step reasoning.

The principle of staged retrieval, particularly Jeff Dean's formulation regarding "the right million" tokens from trillions available, becomes increasingly critical as context windows expand. The empirical evidence from production systems demonstrates that maximizing context window utilization is not the objective; rather, effective systems must develop sophisticated mechanisms for identifying relevant subsets of available information. This insight challenges assumptions that larger context windows automatically improve system performance and suggests that retrieval mechanism sophistication may be more important than absolute context capacity.

Several areas warrant further investigation. The decision logic for selecting among retrieval modalities (vector search, full-text search, pattern matching) remains underspecified in current implementations. Research into meta-learning approaches that optimize retrieval strategy selection based on query characteristics could yield substantial performance improvements. Additionally, the interaction between retrieval iteration depth and answer quality presents opportunities for optimization—determining when agents have retrieved sufficient context versus when additional retrieval steps would improve outcomes represents a challenging trade-off.

6. Conclusion

This analysis demonstrates that RAG is not obsolete but rather evolving from simple vector search implementations to sophisticated agentic retrieval systems. The comparative examination of Cursor's indexed semantic search approach and Cloud Code's grep-based methodology establishes that embeddings function as cached compute, with upfront indexing costs amortized across multiple queries. Empirical evidence from production deployments validates this approach, with Cursor achieving 12.5-13.5% average accuracy improvements and 2.6% code retention gains in large codebases.

The practical implications for retrieval system design are clear: effective systems must support multiple retrieval modalities, implement intelligent caching strategies to avoid redundant computation, and enable iterative, multi-step retrieval processes where agents progressively refine context discovery. As language model context windows continue to expand, the critical capability is not maximizing context utilization but rather developing staged retrieval mechanisms that efficiently identify relevant information subsets. The evolution from simple RAG to agentic retrieval represents not the death of RAG but its maturation into a more sophisticated, iterative, and contextually aware paradigm for augmenting language model capabilities with external information.

Sources

RAG is dead, right?? — Kuba Rogut, Turbopuffer - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub