TECHNOLOGIES

Context Window

The maximum amount of text an LLM can consider at once, measured in tokens, and the reason you cannot just paste your entire knowledge base into every prompt.

Last reviewed: 2026-06-02 byKevin Riedl wiki β†—

The context window is the model’s working memory for a single request, measured in tokens (a token is roughly three-quarters of a word). Everything has to fit inside it: your system prompt, the conversation history, any documents you paste in, and the answer the model generates. Exceed the window and the model literally cannot see the overflow.

“Just put everything in the prompt” fails for three reasons even when the window is large. First, cost: most providers charge per token, so stuffing a huge document into every call multiplies the bill. Second, latency: more tokens means a slower response. Third, and least obvious, quality, models attend less reliably to information buried in the middle of a very long context, so more is not always better. A focused prompt often beats a bloated one.

This is exactly why RAG exists. Instead of dumping your whole corpus into the window, you retrieve only the handful of relevant chunks for each question and send just those. You get the benefit of a large knowledge base without paying to process all of it on every request. The context window is the budget, retrieval is how you spend it wisely.

The practical takeaway: treat the context window as a scarce resource with a price tag, not free space. We design around that budget deliberately under Artificial Intelligence.

// FAQ

FAQs

FAQs

The maximum amount of text an LLM can process in one request, measured in tokens. The system prompt, conversation history, pasted documents, and the generated answer all have to fit inside it.
Cost, latency, and quality. More tokens cost more and respond slower, and models attend less reliably to information buried in a very long context. A focused prompt usually beats a bloated one.
RAG exists to manage it. Instead of loading your whole corpus into the window, you retrieve only the relevant chunks per question, getting the benefit of a large knowledge base without paying to process all of it every time.