Rendering your prompt as an image to cut LLM costs 60%: genius, or just absurd?
A trick is going around: run Claude Fable 5 roughly 60% cheaper by taking the heavy parts of your request, the system prompt, tool docs, old history, pasted code, and turning them into an image before the request reaches the model.
The reasoning sounds absurd, which is exactly why it went viral. An image costs the model a fixed number of tokens based on its pixel size, not on how much text is crammed inside it. So you can pack a lot of characters into one dense picture and pay for the pixels, not the prose.
The tool making the rounds is pxpipe, an open-source (MIT) local proxy. It sits between your machine and the API, and its pipeline renders the bulky, mostly-static context into dense PNG "pages" before the request leaves your laptop. The repository's own demo shows one multi-step task costing $42.21 as plain text versus $6.06 with pxpipe. It claims a 59 to 70% lower end-to-end bill on Fable 5 at current list prices.
So is this genius or nonsense? The honest answer: the physics is real, the research behind it is serious, and the failure mode is bad enough that for most production work you should not turn it on by default. Here is the whole picture, with the caveats the viral version leaves out.
Want a straight answer on where your LLM spend is actually going?
Book Free ConsultationWhy this actually works: the pricing physics
Text and images are billed at the same per-token rate, but they are counted very differently.
Text is tokenized by content. More characters, more tokens, more cost. Anthropic prices an image instead by its pixel area, using a rough formula of (width x height) / 750 tokens, and it caps the count (images are resized so the long edge stays around 1568px, which lands near 1,600 tokens per image at the top end). The key point: that number does not care whether the image is a blank rectangle or a wall of text.
On real Claude Code traffic, pxpipe measures dense content like code and JSON packing around 3.1 characters per image-token, against roughly 1.9 characters per text-token. Once your text is denser than about 19 characters per token, imaging it starts to pay. So a block that would cost 25k text tokens can come back as roughly 2.7k image tokens. That is where the 60% headline comes from.
This is what the model actually receives instead of your text:

One nuance the viral version glosses over: because each image is capped near 1,600 tokens, you cannot pour an unlimited context onto a single giant canvas. The tool renders many pages, not one poster. The savings are real, but they come from tiling dense text across several capped images, not from magic.
This is not a hack. It is a research direction.
The counterintuitive part, that pictures of text can be cheaper than text, is not a proxy-tool gimmick. It is an active area of research.
In October 2025, DeepSeek published DeepSeek-OCR: Contexts Optical Compression, showing a vision model can decode text from a small set of visual tokens at roughly 10x compression while holding around 97% OCR precision when the compression ratio stays under 10x. Andrej Karpathy picked it up to argue that text tokens might be wasteful "historical baggage," and that feeding models images of text could turn out more efficient. Follow-up papers, such as Text or Pixels? It Takes Half, report similar token savings on visual text inputs.
So the idea is legitimate and the long-context economics are genuinely interesting. pxpipe is just an early, aggressive attempt to cash that in on today's commercial APIs, before the models are trained to do it well. And "before the models are trained to do it well" is where the trouble starts.
The catch that makes it absurd for most work
Rendering text as an image is lossy, and the loss is silent.
When the model misreads an imaged character, it does not throw an error or flag low confidence. It confidently makes something up. pxpipe's own README documents the failure honestly: on a needle-in-a-haystack test asking the model to recall exact 12-character hex strings buried in dense imaged content, Fable 5 scored 13 out of 15, and Opus 0 out of 15. The README describes one real case where the model recalled a person's name from imaged chat history and got it confidently wrong.
That is the whole risk in one sentence: anything you need back byte-exact must stay as text. IDs, hashes, secrets, exact numbers, precise names. pxpipe keeps recent turns and exact identifiers as text alongside the images for exactly this reason.
A few more things the headline skips:
- It is model-dependent. pxpipe defaults to Fable 5 and GPT-5.6, the models that read dense imaged text best. Opus 4.8 and GPT-5.5 are opt-in only, because they misread imaged context more. The trick that saves you 60% on one model can quietly corrupt context on another.
- It adds latency. Encoding large requests to PNG takes time before the request even leaves your machine.
- It interacts with prompt caching. Your biggest, most static context, the system prompt and tool docs, is also the ideal candidate for prompt caching, which already discounts repeated tokens heavily. On the GPT path pxpipe forgoes native cache markers. Imaging context and caching context both target the same tokens, so the real comparison is against a properly cached baseline, not a naive one.
When it is worth it, and when it will burn you
This is not a yes or no. It is a routing decision, the same discipline we apply to model selection. Match the technique to the payload.
| Good fit for imaging | Do not image this |
|---|---|
| Large, static system prompts and tool docs | Anything byte-exact: IDs, hashes, secrets, keys |
| Read-only reference context and long docs | Exact numbers you will compute or quote |
| Collapsed, older conversation history | Recent turns the model must reason over precisely |
| Fable 5 or other strong image readers | Opus-routed or vision-weaker workloads |
| Bulk context where gist is enough | Anything where a silent misread is unacceptable |
If your workload is a huge, stable instruction block feeding a Fable 5 agent that mostly needs the gist, imaging can be a real win. If it is a compliance workflow moving exact figures and identifiers, the same trick is a quiet liability.
Where this fits in a real cost stack
Imaging context is one lever, and not the first one we would pull. Before reaching for a lossy trick, the boring levers usually win, and they do not risk your data:
- Prompt caching for the static prefix, which is lossless and already large.
- Model routing: cheap models for mechanical work, strong models for judgment. See how we route work across Fable, Opus, Sonnet and Haiku.
- Measuring cost per completed task, not price per token, which is the number that lands on your invoice. See cheaper per token, more expensive per answer.
- A gateway to centralise fallback, caching and spend limits. See our LLM gateway comparison.
- Self-hosting or open weights when volume and data residency justify it, covered in the real cost of self-hosting LLMs in the EU.
Imaging context sits at the aggressive end of that list: high potential savings, real correctness risk, worth piloting on the right payload once the safer levers are in place.

"The pricing physics is real and the research is serious. But a 60% saving that occasionally invents a hash or a name is not a saving, it is deferred debugging. Image the bulk context that only needs the gist, keep every exact value as text, and never point it at a model that reads images poorly."
Frequently Asked Questions
Is rendering context as images safe for production?
Does imaging context break prompt caching?
Why does Opus do worse than Fable at reading imaged text?
Is this the same as DeepSeek-OCR?
How much can it actually save?
Final thoughts
So, genius or absurd? Both. The mechanism is real, image tokens are priced by pixels, and serious research points the same way. But bolting it onto models that were not trained for it trades money for silent errors, and silent errors are the most expensive kind.
Use it the way you would use any aggressive optimisation: deliberately, on the payload that fits, with the exact values kept as text and the safer levers, caching, routing, measurement, already in place. Do that and imaging bulk context is a sharp tool. Turn it on everywhere and it will eventually hand you a confidently wrong answer you never see coming.
Want help wiring caching, routing and cost measurement into your stack?
Book Free Consultation