TECHNOLOGIES

RAG

Retrieval-Augmented Generation

Inject relevant context into an LLM prompt at runtime, retrieved from your own data, so the model answers from your knowledge instead of its training data.

Last reviewed: 2026-05-24 byKevin Riedl wiki β†—

RAG is the architecture that lets an LLM answer questions about data it was not trained on. The mechanism is straightforward: take the user’s question, retrieve the most relevant chunks from your corpus (via vector search, keyword search, or a hybrid), stuff them into the prompt, and let the model answer.

The reason RAG exists: training a model on your private data is expensive, slow, and obsolete the moment your data changes. RAG sidesteps that by treating the data as runtime context. The trade-off is that retrieval quality becomes the bottleneck: a model with bad context produces confidently wrong answers.

The unsexy truth about RAG: 80% of the work is making retrieval good (chunking strategy, embeddings choice, reranking, hybrid search) and 20% is the model itself. Vendors that pitch RAG as a one-click feature are pitching the easy part.

// FAQ

FAQs

FAQs

Almost always retrieval, not generation. Bad chunking, weak embeddings, no reranking, or a corpus the model cannot disambiguate. Swap the model and the answers stay wrong; fix retrieval and the same model suddenly looks smart.
Hybrid (vector + keyword + reranker) beats vector-only in almost every production benchmark we have run. Pure vector search misses on exact matches, acronyms, and rare terms. The extra plumbing is worth it.
When the answer requires synthesising across many documents, when the corpus has heavy contradictions, or when the user query is more conceptual than lookup-shaped. At that point you want either fine-tuning, an agentic pipeline, or a smaller better-curated corpus.