RAG
Retrieval-Augmented Generation
Inject relevant context into an LLM prompt at runtime, retrieved from your own data, so the model answers from your knowledge instead of its training data.
RAG is the architecture that lets an LLM answer questions about data it was not trained on. The mechanism is straightforward: take the user’s question, retrieve the most relevant chunks from your corpus (via vector search, keyword search, or a hybrid), stuff them into the prompt, and let the model answer.
The reason RAG exists: training a model on your private data is expensive, slow, and obsolete the moment your data changes. RAG sidesteps that by treating the data as runtime context. The trade-off is that retrieval quality becomes the bottleneck: a model with bad context produces confidently wrong answers.
The unsexy truth about RAG: 80% of the work is making retrieval good (chunking strategy, embeddings choice, reranking, hybrid search) and 20% is the model itself. Vendors that pitch RAG as a one-click feature are pitching the easy part.