Kevin Riedl

9 min read · 15 Jun 2026

How to Cut LLM Token Costs in 2026: Routing, Caching, Compression, and the Right Model

Token prices collapsed, yet plenty of teams are paying more for LLM usage than they did a year ago. The reason is simple. Per-token prices dropped, but agentic products now make dozens to hundreds of model calls per task, and most of those tokens are context the model never needed. Cheap tokens times high call volume is still a large bill. This is the playbook we apply to bring it down without dropping quality, in the order we apply it.

Engineering perspective, not a vendor pitch. The price and benchmark points below are directional, drawn from public 2026 pricing trends, not vendor-specific quotes. Reference points come from Wavect's AI product work.

Token bill out of control?

 Book Free Consultation

Why is your LLM bill still high when token prices collapsed?

Three things hide inside a large bill, and none of them are the headline per-token price:

  • Call volume. An agent loop that makes 50 to 200 model calls per task turns a cheap per-token price into an expensive per-task price. The unit you pay for is the task, not the token.
  • Wasted context. A large share of input tokens on a typical call is context the model does not need for that step. Industry write-ups put the waste in the 40 to 60 percent range for unoptimised agentic workflows. You pay for every one of those tokens on every call.
  • The wrong model on the wrong task. Routing every request to a frontier model "to be safe" is the single most common way teams overpay. Most requests do not need your most expensive model.

Fix these in order. The cheapest wins come first, and they require no model retraining and no architecture rewrite.

What is the fastest win? Prompt caching and batching.

Before you touch your architecture, take the two discounts the providers hand you for free.

  • Prompt caching. When consecutive calls share a stable prefix (system prompt, instructions, retrieved context), the provider can skip reprocessing it. Cached input is roughly 90 percent cheaper on Anthropic, around half price on OpenAI, and Google charges about 10 percent of the base rate on cache hits. The engineering move is ordering: put the stable content first and the volatile user input last, so the cache prefix stays intact across calls.
  • Batch processing. Every major provider offers a batch endpoint at roughly half the live rate in exchange for an asynchronous completion window. Anything that does not need a sub-second response, evaluations, enrichment, classification, summarisation jobs, should run in batch.

These discounts stack. Cache hit plus batch on the same workload can land cached input around 95 percent below the standard rate. A team processing hundreds of thousands of documents a month can cut a four-figure monthly bill to a fraction by changing nothing but the endpoint and the prompt order.

Kevin Riedl

"Most teams reach for a model swap when the cheapest win is reordering their prompt so the cache actually hits."

How does model routing cut cost without hurting quality?

Routing means a cheap model handles the easy majority and an expensive model handles the hard minority. Done blindly it degrades quality. Done with a confidence check it does not.

  • Cheap default plus escalation. Run a mid-tier or small model first. If a structured confidence check fails, the answer is low-confidence, schema-invalid, or flagged by a verifier, escalate to a frontier model. Track the escalation rate as a product KPI. A rising rate tells you the cheap model is being asked to do too much.
  • Routers and gateways. Open frameworks like RouteLLM publish hard numbers: roughly 95 percent of frontier-model quality while sending only 14 to 26 percent of calls to the strong model, which lands as a 75 to 85 percent cost reduction on the routed traffic. An LLM gateway in front of multiple providers also gives you one place to set caching, fallback, and spend limits.

We use the escalation pattern in production AI work, including engagements like Twinsoft AI. The discipline that makes it safe is the same one that makes everything else here safe: an eval harness that tells you whether the cheap path actually held quality.

Which frontier models should you actually use in 2026?

There is no single best model. There is a best model per task, and the price-to-performance spread is now wide enough that model choice is one of your largest cost levers. The 2026 landscape splits into two camps.

  • Western frontier. Claude, GPT, and Gemini still lead on the hardest reasoning and coding work and on the deepest agent loops. When a wrong answer is expensive, the frontier model usually wins on total cost once you count the developer time spent fixing bad output.
  • Chinese open-weight frontier. DeepSeek, Qwen, Kimi, and GLM have closed most of the quality gap on real-world coding and reasoning, at prices that are commonly 15 to 30 times lower per token than the Western frontier. For high-volume, cost-sensitive workloads, they change the math.

Directional pricing by class, normalised per 1M tokens. Treat as a snapshot of public trends, not a quote, and re-check before you commit.

ClassExample tierInputOutputBest for
Western frontier reasoningTop Claude / GPT / Gemini tier~$2 to $3~$10 to $15Hardest reasoning, deep agents
Western frontier generalMid Claude / GPT / Gemini tier~$0.60~$3Quality-sensitive default
Chinese open-weight frontierKimi / Qwen Max class~$0.95 to $1.25~$2 to $5Strong coding at lower cost
Chinese budget / flashDeepSeek flash class~$0.14~$0.28High-volume, cost-sensitive

The catch for an EU team is not quality, it is governance. Where the inference runs and where the data lands matters for data-residency and compliance. Use a Chinese open-weight model self-hosted on EU infrastructure and you keep the price advantage without sending data abroad. Use it through a non-EU API and you have a compliance question to answer first. Either way, run your own eval before you swap. A cheaper model that fails 1 in 10 tasks is not cheaper.

Hybrid local plus frontier: when does self-hosting open weights pay off?

The hybrid pattern is a small or open-weight model for the bulk of the volume, a frontier API for the hard tail. The question is when to bring the bulk in-house. The honest answer in 2026: later than most teams think.

  • The break-even is governed by engineer time, not GPU rack rate. The model is cheap to run. The ops, the eval discipline, and the upgrade cadence are not.
  • For most products, hosted APIs stay cheaper until you are sustaining serious volume, often quoted around 50 million tokens per day or more, or until a data-residency requirement forces local hosting regardless of cost.
  • When you do self-host, an inference engine like vLLM plus quantised open weights (Llama, Qwen, DeepSeek, Mistral class) is the standard production stack.

Default to hosted APIs for early products. Revisit self-hosting once your volume or your compliance posture forces the question. We go deeper on the architecture implications in what cheap tokens change in your AI architecture.

How do you stop paying for tokens the model does not need?

This is where the wasted-context problem gets solved, and where the biggest structural savings live after caching.

  • Semantic caching. Store request and response pairs and return a cached answer for a semantically similar query. On a hit, you skip the model call entirely. Tools like GPTCache and Redis-backed caches report cost reductions around 70 percent on high-repetition workloads.
  • Context compression. Agentic and coding workflows re-send the same files, logs, and history on every call. A compression layer strips that down to what the step needs. Open tools in this space, for example lean-ctx and RTK (Rust Token Killer), sit between your agent and the model and cut input tokens before you pay for them. The principle matters more than the specific tool: send the model the smallest correct context, not your whole workspace.
  • Inference-layer KV-cache compression. If you self-host, KV-cache eviction and quantisation techniques cut the memory and compute cost of long contexts. This is a knob for self-hosting teams, not for API consumers.

What order should you do this in?

The priority list we work through, cheapest and least risky first:

  1. Prompt caching. Reorder prompts stable-prefix-first. No quality risk, large saving.
  2. Batch the async work. Move anything latency-tolerant to the batch endpoint at half price.
  3. Routing with escalation. Cheap default, confidence-gated escalation to frontier. Track the escalation rate.
  4. Right-size the model. Evaluate open-weight and Chinese frontier models against your task. Swap on a proven eval, not on a benchmark headline.
  5. Compress context. Semantic cache the repeats, compress the per-call context.
  6. Self-host only at volume. Bring the bulk in-house when volume or compliance forces it, not before.
  7. Build the eval harness. None of the above is safe to ship without one. It is what tells you a cheaper path kept the quality bar. See SDLC.

Steps one through three usually deliver the majority of the saving in the first week, with no architecture change. Steps four through six are where you compound it.

Final thoughts

Cutting LLM cost in 2026 is not about finding the one cheap model. It is a stack of compounding moves applied in the right order: cache what repeats, batch what can wait, route the easy majority to a cheap model, right-size the model per task including the open-weight and Chinese frontier options, compress the context you actually send, and self-host only when volume or compliance forces it.

The honest part: every one of these is only safe on top of an eval harness. Without evals you cannot tell whether the cheaper path held quality, and a cheaper path that quietly drops quality is the most expensive mistake of all. Start with caching and batching this week, prove the routing with an eval, and revisit the model mix every few months. The price curve has not stopped moving, and neither should your stack.

Want a second opinion on your AI cost stack?

 Book Free Consultation
Kevin Riedl

9 min read · 15 Jun 2026