TECHNOLOGIES

RAG

Retrieval-Augmented Generation

Inject relevant context into an LLM prompt at runtime, retrieved from your own data, so the model answers from your knowledge instead of its training data.

Last reviewed: byKevin Riedl wiki ↗

RAG is the architecture that lets an LLM answer questions about data it was not trained on. The mechanism is straightforward: take the user’s question, retrieve the most relevant chunks from your corpus (via vector search, keyword search, or a hybrid), stuff them into the prompt, and let the model answer. It is the grounding layer underneath most useful AI agents.

The reason RAG exists: training a model on your private data is expensive, slow, and obsolete the moment your data changes. RAG sidesteps that by treating the data as runtime context. The trade-off is that retrieval quality becomes the bottleneck: a model with bad context produces confidently wrong answers.

Worked example of the classic failure: a company builds a support bot over its help docs, the demo is impressive, and then in production it confidently cites the wrong refund policy. The instinct is to blame the model and try a bigger one. The answers stay wrong, because the problem is upstream: the docs were chunked mid-sentence, the embeddings cannot tell “refund” from “return”, and there is no reranker to push the right passage to the top. Swap in better chunking, hybrid search, and a reranker and the same model suddenly looks smart. The lesson generalises: when RAG is wrong, suspect retrieval before generation almost every time.

The honest trade-off and where RAG breaks down: it excels at lookup-shaped questions answerable from a few passages, and degrades when the answer requires synthesising across many documents or reconciling contradictions in the corpus. At that point you want a smaller curated corpus, an agentic pipeline that reasons in steps, or fine-tuning, not a bigger retriever. The unsexy truth is that 80% of the work is making retrieval good (chunking, embeddings, reranking, hybrid search) and 20% is the model. Wrapping your data sources as MCP servers makes the retrieval layer portable, but it does not make it good. Vendors that pitch RAG as a one-click feature are pitching the easy part.

// FAQ

FAQs

Almost always retrieval, not generation. Bad chunking, weak embeddings, no reranking, or a corpus the model cannot disambiguate. Swap the model and the answers stay wrong; fix retrieval and the same model suddenly looks smart.
Hybrid (vector + keyword + reranker) beats vector-only in almost every production benchmark we have run. Pure vector search misses on exact matches, acronyms, and rare terms. The extra plumbing is worth it.
When the answer requires synthesising across many documents, when the corpus has heavy contradictions, or when the user query is more conceptual than lookup-shaped. At that point you want either fine-tuning, an agentic pipeline, or a smaller better-curated corpus.