Context Window
The maximum amount of text an LLM can consider at once, measured in tokens, and the reason you cannot just paste your entire knowledge base into every prompt.
The context window is the model’s working memory for a single request, measured in tokens (a token is roughly three-quarters of a word). Everything has to fit inside it: your system prompt, the conversation history, any documents you paste in, and the answer the model generates. Exceed the window and the model literally cannot see the overflow.
“Just put everything in the prompt” fails for three reasons even when the window is large. First, cost: most providers charge per token, so stuffing a huge document into every call multiplies the bill. Second, latency: more tokens means a slower response. Third, and least obvious, quality, models attend less reliably to information buried in the middle of a very long context, so more is not always better. A focused prompt often beats a bloated one.
This is exactly why RAG exists. Instead of dumping your whole corpus into the window, you retrieve only the handful of relevant chunks for each question and send just those. You get the benefit of a large knowledge base without paying to process all of it on every request. The context window is the budget, retrieval is how you spend it wisely.
The practical takeaway: treat the context window as a scarce resource with a price tag, not free space. We design around that budget deliberately under Artificial Intelligence.