Wiki: Retrieval-augmented generation

Retrieval-augmented generation pairs a language model with a retrieval step that pulls relevant context from an external corpus before generation. The basic pattern is well-established, but several sources here push on its assumptions and identify cases where it underperforms or can be replaced.

The most direct challenge comes from PageIndex, which replaces vector similarity search with hierarchical tree indexes over long documents, using LLM reasoning to navigate the index rather than embedding distance. On FinanceBench it reaches 98.7% accuracy. The authors frame this as “vectorless RAG” — retrieval without the retrieval bottleneck that comes from chunking documents into semantically lossy fragments.

A complementary critique appears in the Karpathy LLM Wiki threads. One Reddit implementation found that for curated research, having the model synthesize documents into structured Markdown at ingest time produces better cross-document reasoning than querying a vector store at runtime. The accompanying how-to notes this approach avoids RAG entirely at query time, at the cost of hallucinations baked in structurally if the ingest step is not linted.

On the infrastructure side, RAG workloads are a primary driver of KV cache optimization. Pure KVA’s granular prompt caching segments prompts into reusable chunks so only changed tokens are reprocessed, cutting time-to-first-token and GPU cost for repeated RAG queries. The companion post reports up to 20x faster inference by persisting attention states across sessions on NFS and S3.

RAG also appears as a standard component in agentic pipelines. AgentSwarms includes RAG alongside ReAct and multi-agent patterns as a foundational building block. OpenAI’s internal data agent uses layered context — schema metadata, annotations, institutional docs — that functions as a structured form of RAG over 600+ petabytes. Headroom addresses a downstream problem: RAG chunks inflating context windows, which it compresses by 60-95% before they reach the model.

The thread connecting these sources is that RAG is not a single technique but a design space. Vector similarity search, hierarchical indexing, compiled knowledge bases, and layered metadata retrieval are all answers to the same question — how to give a model the right information at the right time — and the best answer depends on document structure, query patterns, and latency constraints.