TECHNOLOGIES

Context Window

The maximum amount of text an LLM can consider at once, measured in tokens, and the reason you cannot just paste your entire knowledge base into every prompt.

Last reviewed: 2026-06-02 byKevin Riedl wiki ↗

The context window is the LLM ’s working memory for a single request, measured in tokens (a token is roughly three-quarters of a word). Everything has to fit inside it: your system prompt, the conversation history, any documents you paste in, and the answer the model generates. Exceed the window and the model literally cannot see the overflow.

“Just put everything in the prompt” fails for three reasons even when the window is large. First, cost: most providers charge per token, so stuffing a huge document into every call multiplies the bill. Second, latency: more tokens means a slower response. Third, and least obvious, quality, models attend less reliably to information buried in the middle of a very long context, so more is not always better. A focused prompt often beats a bloated one.

This is exactly why RAG exists. Instead of dumping your whole corpus into the window, you retrieve only the handful of relevant chunks for each question and send just those. You get the benefit of a large knowledge base without paying to process all of it on every request. The context window is the budget; retrieval and good prompt-engineering are how you spend it wisely.

Worked example of the “lost in the middle” effect that surprises teams: a company pastes a 40-page policy document into the prompt and asks a question whose answer sits on page 20. The model, with the whole document technically inside its window, still gets it wrong, because attention degrades for material buried in the middle of a long context. The same model, handed only the two relevant paragraphs that retrieval pulled out, answers correctly. Bigger windows did not fix the problem; better-targeted context did. This is the counter-intuitive part founders miss when a new model ships with a headline-grabbing window size: more capacity is not more reliability.

The practical takeaway: treat the context window as a scarce resource with a price tag, not free space. Bigger windows lower the pressure but do not remove it, and cost and latency still scale with what you put in. We design around that budget deliberately under Artificial Intelligence .

What is a context window? +

The maximum amount of text an LLM can process in one request, measured in tokens. The system prompt, conversation history, pasted documents, and the generated answer all have to fit inside it.

Why not just put everything in the prompt? +

Cost, latency, and quality. More tokens cost more and respond slower, and models attend less reliably to information buried in a very long context. A focused prompt usually beats a bloated one.

How does the context window relate to RAG? +

RAG exists to manage it. Instead of loading your whole corpus into the window, you retrieve only the relevant chunks per question, getting the benefit of a large knowledge base without paying to process all of it every time.

FAQs