Cheaper Per Token. More Expensive Per Answer.
Claude Sonnet 5 launched at a lower price per token than Opus 4.8. Then Artificial Analysis ran the full Intelligence Index benchmark suite, and Sonnet 5 finished the run at a higher total cost per task than Opus, roughly 2.29 US dollars against 1.99, before promotional pricing (1).
Read that again. The cheaper model produced the bigger invoice.
Almost nobody reads the per-million-token number and asks the question that actually decides the bill: how many tokens does this model burn to get to the right answer.
A model that reasons in circles is not cheap. It is cheap to start.
The teams optimising spend right now are watching total cost per completed task, not the sticker price per token. Everyone else is about to open a very confusing invoice.
The number nobody prices: tokens-to-answer
Per-million-token pricing is the sticker on the window. It tells you the rate. It tells you nothing about how far the model drives to reach the destination.
Two costs hide behind a single API call:
- The unit price. Dollars per million input and output tokens. This is what every pricing page advertises.
- The token count. How many tokens the model actually consumes to finish your task, including reasoning, retries, and tool calls you never see.
Your bill is the product of the two, not the first one. Artificial Analysis calls the honest number cost per task: the weighted-average cost to complete one benchmark task, which prices the tokens a model actually consumes rather than a standardised rate. As they put it, models that produce longer answers or more reasoning tokens have a higher cost per task even at identical per-token prices (2).
Cut the unit price by 40 percent and let token consumption rise by more than that, and you have made the model look cheaper while making it more expensive.
What actually happened with Sonnet 5
The Sonnet 5 launch is the clean case study, because Anthropic did cut the price and the model still cost more to run.
On paper, Sonnet 5 is the bargain. Standard rates are 3 US dollars per million input tokens and 15 per million output, with an introductory 2 and 10 running through 31 August 2026. Opus 4.8 sits at 5 and 25 (3). By the sticker, Sonnet is roughly 40 percent cheaper per token, and around 60 percent cheaper during the intro window.
Then you run it. Artificial Analysis found that at maximum reasoning effort, Sonnet 5 used about 40 percent more output tokens per Intelligence Index task than Sonnet 4.6, and roughly three times the agentic turns. On knowledge-work evaluations it burned around six times more turns at max effort than at low effort. The performance gains came through longer reasoning chains and more tool calls, not through efficiency (1).
Stack that on a second, quieter change: Sonnet 5 ships an updated tokenizer that maps the same text to roughly 1.0 to 1.35 times as many tokens as the previous generation (4). So the same prompt is counted as more tokens before the model has reasoned about anything.
Lower rate, more tokens per answer, more tokens per unit of text. The three combine into the result nobody put on a slide: on the full suite, Sonnet 5 came out more expensive per completed task than the model it was meant to undercut.

"A cheaper unit price on a model that reasons in circles is not a discount. It is a deferred invoice. The teams that win read the whole receipt, not the price on the shelf."
Figures here are a 2026 snapshot from public benchmarks and vendor pricing. Rates, tokenizers, and model behaviour move fast, and your workload is not the benchmark suite. Re-check the numbers and, more importantly, measure your own before you commit.
Why reasoning models break the sticker price
This is not a Sonnet problem. It is a reasoning-model problem, and it is structural.
Reasoning models earn their scores by thinking before they answer. That thinking is tokens: internal reasoning, self-verification, tool calls, and retries, most of which you pay for and never read. The token efficiency of a model, the number of tokens it needs to actually complete a task, is a more decisive cost factor than its headline price (5).
The gap between models can be enormous. In one public reasoning benchmark, a small reasoning model generated over ten times more completion tokens than a comparable non-reasoning model on the same problems (6). Same task, same answer expected, an order of magnitude more tokens spent getting there.
So a model can be:
- Cheaper per token and more expensive per task, because it thinks longer.
- More expensive per token and cheaper per task, because it reaches the answer in one pass instead of five.
The sticker price and the real cost are not just different numbers. They can point in opposite directions.
Cost per task, defined
If you take one metric from this article, take this one.
Cost per completed task is the total spend, across every token and every turn, to get one real task done to your quality bar. Not per token. Not per request. Per finished, acceptable answer.
It captures what the sticker price hides:
- Reasoning tokens. The thinking the model does before it answers.
- Output length. A verbose model bills more even at the same rate.
- Agentic turns. Every tool call and follow-up is another priced round trip.
- Retries. Wrong answers you have to run again are not free.
- Tokenizer drift. The same text can count as more tokens on a newer model.
A model that is cheap to start and expensive to finish fails this measure. That is the whole point of using it.
Want a straight read on which model is actually cheapest for your workload?
Book Free ConsultationHow to measure cost per completed task
You do not need a research lab. You need your own tasks and a scale. Here is the process we run before we recommend a model to a client.
- Define the task and the quality bar. Not "summarise this," but "produce a summary that passes this rubric." A task is only complete when it meets the bar, otherwise the retry belongs in the cost.
- Build a small eval set from real work. Twenty to fifty representative tasks from your actual product beat any public benchmark, because the benchmark is not your workload.
- Run each candidate model to completion. Same tasks, same settings you would ship. Let it reason, call tools, and retry the way it will in production.
- Count every token to done. Input, output, reasoning, and each agentic turn. Use the provider's token counting rather than an estimate, because tokenizers differ between models.
- Price the whole path, including failures. Multiply tokens by the real rate, add the cost of retries on tasks the model got wrong the first time. That total, divided by tasks completed, is your cost per completed task.
Do this once and the ranking often flips. The model with the scary per-token rate can be the cheapest to finish, and the cheap-looking model can be the one quietly running up the bill.
What this means for model choice
The lesson is not "always pick the expensive model." It is "stop choosing on the sticker."
A few rules we work by:
- Match the model to the task, not the price list. A capable model that answers in one pass can be cheaper per task than a weaker one that loops. Route simple, high-volume work to cheap models and hard, ambiguous work to strong ones. We wrote the full routing playbook in how to cut LLM token costs in 2026.
- Tune the effort dial. On reasoning models, maximum effort is where cost per task explodes. Use high effort where correctness is worth it and lower effort for routine work, then measure the difference on your own eval.
- Watch the agentic turn count, not just the tokens. Every extra tool call and retry is another billed round trip. A model that finishes in three turns can beat one that finishes in ten even at a higher rate.
- Re-run the numbers when a model updates. A new version can change the tokenizer and the reasoning behaviour at once, as Sonnet 5 did. Last quarter's cost ranking is not this quarter's.
Price per token is the marketing number. Cost per completed task is the number that lands on your invoice. Optimise the one you actually pay.
Final thoughts
Sonnet 5 launched cheaper and ran more expensive. That is not a fluke, it is what happens when a reasoning model thinks longer to score higher and you priced it on the sticker. The fix is not a different model. It is a different number: total cost per completed task, measured on your own work, including reasoning, turns, and retries.
Read the whole receipt. The teams that do are already paying less for better answers. The teams that do not are about to get a very confusing invoice.
Want us to benchmark cost per task across models for your product?
Book Free ConsultationReferences
- Artificial Analysis (2026) ‘Claude Sonnet 5: strong agentic performance at a higher cost per task.’ Cost per Intelligence Index task (~$2.29 vs ~$1.99 for Opus 4.8, ~$1.15 for Sonnet 4.6); ~40% more output tokens and ~3x agentic turns vs Sonnet 4.6 at max effort. Available at: https://artificialanalysis.ai/articles/claude-sonnet-5-agentic-cost (Accessed: 2 July 2026).
- Artificial Analysis (2026) ‘Language Model Benchmarking Methodology.’ Definition of Cost per Task as the weighted-average cost to complete one Intelligence Index task; longer answers and more reasoning tokens raise cost per task at identical per-token prices. Available at: https://artificialanalysis.ai/methodology (Accessed: 2 July 2026).
- Anthropic (2026) ‘Models overview and pricing.’ Claude Sonnet 5 at $3/$15 per million tokens ($2/$10 introductory through 31 August 2026); Claude Opus 4.8 at $5/$25. Available at: https://platform.claude.com/docs/en/about-claude/models/overview (Accessed: 2 July 2026).
- Anthropic (2026) ‘Model migration guide.’ Claude Sonnet 5 uses an updated tokenizer that maps the same text to roughly 1.0 to 1.35 times as many tokens as the previous generation; re-baseline with token counting. Available at: https://platform.claude.com/docs/en/about-claude/models/migration-guide (Accessed: 2 July 2026).
- CloudZero (2026) ‘LLM API pricing comparison.’ Token efficiency, the number of tokens a model needs to complete a task, is a more critical cost factor than headline per-token price. Available at: https://www.cloudzero.com/blog/llm-api-pricing-comparison/ (Accessed: 2 July 2026).
- Wang, L. et al. (2025) ‘NPPC: an ever-scaling reasoning benchmark for LLMs.’ A small reasoning model generated roughly an order of magnitude more completion tokens than a comparable non-reasoning model on the same tasks. Available at: https://arxiv.org/pdf/2504.11239 (Accessed: 2 July 2026).