If you architected an AI product in 2024, you spent half your engineering time hiding the price of tokens. Aggressive retrieval, brittle summarisation, model routing for every call. In 2026, frontier-class model pricing per million tokens is roughly one fifth of what it was two years ago. That changes the math on almost every design decision we made. This post is what we actually rewire in client architectures now, with a side-by-side cost table and a list of concrete moves.
This is written from Wavect's engagement history across AI product builds. Numbers in the table are illustrative based on public pricing trends, not vendor-specific commitments.
Rebuilding your AI stack?
Book Free ConsultationFor frontier-class models on the major providers, the per-token list price in 2026 is roughly 70 to 85% lower than the equivalent class in 2024, depending on tier. Mid-tier models dropped further. Cached input pricing dropped harder still. What did not drop: latency at high concurrency, egress, vector database hosting, and the human cost of building evals. So your bill went down, your architecture leverage went up, but your engineering judgement matters more, not less.
Rough illustrative numbers, normalised per 1M tokens, frontier and mid-tier classes. Treat as directional, not as a quote.
| Model class | 2024 input | 2026 input | 2024 output | 2026 output |
|---|---|---|---|---|
| Frontier reasoning | $15 | $3 | $75 | $15 |
| Frontier general | $3 | $0.60 | $15 | $3 |
| Mid-tier general | $0.50 | $0.10 | $1.50 | $0.30 |
| Small / fast | $0.15 | $0.03 | $0.60 | $0.10 |
| Cached input | n/a | $0.30 | n/a | n/a |
The interesting line is "frontier reasoning". A deep agent loop that cost $0.40 per task in 2024 costs closer to $0.08 today. That changes which products are viable.
We stopped over-engineering retrieval for small corpora. We stopped routing every call through a "cheap default" when the quality gap mattered. We stopped writing custom summarisers to fit tiny context windows.
Eight concrete moves we apply in 2026 client work.

"Your AI architecture should track the price curve, not freeze at the day you started building."
Yes, but the threshold moved. RAG is still the right answer when the corpus is large (multi-million tokens), when freshness matters (knowledge that changes daily), when access control needs row-level enforcement, or when you need a clear citation trail. For everything else, long context is usually simpler. We rebuilt a knowledge product in 2026 by deleting most of the retrieval layer and moving to structured long-context prompts. The eval scores improved and the maintenance burden dropped. Engagements like PromptID and Quivr shaped how we draw that line.
In 2024 the bill was dominated by inference. In 2026 it splits more evenly across inference, hosted vector or search infrastructure, observability and eval runs, and a non-trivial line for human review on agent products. A typical mid-size AI product we work on has inference at 30 to 45% of total run cost, down from 70 to 80% two years ago. The implication: optimising inference further has diminishing returns. Optimise the eval loop and the tool surface instead.
Open-weight models closed a lot of the quality gap in 2026. For high-volume, latency-sensitive, or data-residency-sensitive workloads, self-hosted open weights are now genuinely competitive. The catch: you take on the ops burden, the eval burden, and the upgrade cadence. We default to hosted APIs for early products and revisit self-hosting once volume justifies it, usually north of 50 million tokens per day.
We still use agile fixed price for scoped deliverables. What changed is the run-cost forecast. We model expected token volume, cache hit ratio, escalation rate, and batch share. A modern AI feature for a mid-market client typically runs at 30 to 60% of the inference cost we would have quoted in 2024 for the same quality bar. The engineering effort moved from cost hiding to quality engineering.
Tokens got cheap. That is not a tactical change, it is a structural one. The teams that win in 2026 are the ones that stop optimising for the 2024 bill and start optimising for product depth: deeper agent loops, longer context, richer tool surfaces, and a serious eval discipline. The teams that lose are the ones who still treat the frontier model as a luxury good and route everything through a mid-tier just to feel safe. If you built your AI architecture before mid-2025, it is worth a structural review. Most of the clever workarounds you wrote are now liabilities. The good news: cleaning them up usually shrinks the codebase, drops the bill, and lifts the eval scores at the same time. That is the rare three-way win in software, and it is on the table for the next 12 months while the rest of the market is still arguing about it.
Rebuilding your AI stack?
Book Free Consultation