Kevin Riedl

10 min read · 03 Jul 2026

Rendering your prompt as an image to cut LLM costs 60%: genius, or just absurd?

A trick is going around: run Claude Fable 5 roughly 60% cheaper by taking the heavy parts of your request, the system prompt, tool docs, old history, pasted code, and turning them into an image before the request reaches the model.

The reasoning sounds absurd, which is exactly why it went viral. An image costs the model a fixed number of tokens based on its pixel size, not on how much text is crammed inside it. So you can pack a lot of characters into one dense picture and pay for the pixels, not the prose.

The tool making the rounds is pxpipe, an open-source (MIT) local proxy. It sits between your machine and the API, and its pipeline renders the bulky, mostly-static context into dense PNG "pages" before the request leaves your laptop. The repository's own demo shows one multi-step task costing $42.21 as plain text versus $6.06 with pxpipe. It claims a 59 to 70% lower end-to-end bill on Fable 5 at current list prices.

So is this genius or nonsense? The honest answer: the physics is real, the research behind it is serious, and the failure mode is bad enough that for most production work you should not turn it on by default. Here is the whole picture, with the caveats the viral version leaves out.

Want a straight answer on where your LLM spend is actually going?

 Book Free Consultation

Why this actually works: the pricing physics

Text and images are billed at the same per-token rate, but they are counted very differently.

Text is tokenized by content. More characters, more tokens, more cost. Anthropic prices an image instead by its pixel area, using a rough formula of (width x height) / 750 tokens, and it caps the count (images are resized so the long edge stays around 1568px, which lands near 1,600 tokens per image at the top end). The key point: that number does not care whether the image is a blank rectangle or a wall of text.

On real Claude Code traffic, pxpipe measures dense content like code and JSON packing around 3.1 characters per image-token, against roughly 1.9 characters per text-token. Once your text is denser than about 19 characters per token, imaging it starts to pay. So a block that would cost 25k text tokens can come back as roughly 2.7k image tokens. That is where the 60% headline comes from.

This is what the model actually receives instead of your text:

A dense wall of whitespace-minified text rendered as a single image, roughly 48k characters packed into about 2.7k image tokens, with an OCR instruction banner across the top
Around 48k characters of system prompt and tool docs, about 25k tokens as text, rendered as roughly 2.7k image tokens on one page. Source: the pxpipe repository (MIT), used for illustration.

One nuance the viral version glosses over: because each image is capped near 1,600 tokens, you cannot pour an unlimited context onto a single giant canvas. The tool renders many pages, not one poster. The savings are real, but they come from tiling dense text across several capped images, not from magic.

This is not a hack. It is a research direction.

The counterintuitive part, that pictures of text can be cheaper than text, is not a proxy-tool gimmick. It is an active area of research.

In October 2025, DeepSeek published DeepSeek-OCR: Contexts Optical Compression, showing a vision model can decode text from a small set of visual tokens at roughly 10x compression while holding around 97% OCR precision when the compression ratio stays under 10x. Andrej Karpathy picked it up to argue that text tokens might be wasteful "historical baggage," and that feeding models images of text could turn out more efficient. Follow-up papers, such as Text or Pixels? It Takes Half, report similar token savings on visual text inputs.

So the idea is legitimate and the long-context economics are genuinely interesting. pxpipe is just an early, aggressive attempt to cash that in on today's commercial APIs, before the models are trained to do it well. And "before the models are trained to do it well" is where the trouble starts.

The catch that makes it absurd for most work

Rendering text as an image is lossy, and the loss is silent.

When the model misreads an imaged character, it does not throw an error or flag low confidence. It confidently makes something up. pxpipe's own README documents the failure honestly: on a needle-in-a-haystack test asking the model to recall exact 12-character hex strings buried in dense imaged content, Fable 5 scored 13 out of 15, and Opus 0 out of 15. The README describes one real case where the model recalled a person's name from imaged chat history and got it confidently wrong.

That is the whole risk in one sentence: anything you need back byte-exact must stay as text. IDs, hashes, secrets, exact numbers, precise names. pxpipe keeps recent turns and exact identifiers as text alongside the images for exactly this reason.

A few more things the headline skips:

  • It is model-dependent. pxpipe defaults to Fable 5 and GPT-5.6, the models that read dense imaged text best. Opus 4.8 and GPT-5.5 are opt-in only, because they misread imaged context more. The trick that saves you 60% on one model can quietly corrupt context on another.
  • It adds latency. Encoding large requests to PNG takes time before the request even leaves your machine.
  • It interacts with prompt caching. Your biggest, most static context, the system prompt and tool docs, is also the ideal candidate for prompt caching, which already discounts repeated tokens heavily. On the GPT path pxpipe forgoes native cache markers. Imaging context and caching context both target the same tokens, so the real comparison is against a properly cached baseline, not a naive one.

When it is worth it, and when it will burn you

This is not a yes or no. It is a routing decision, the same discipline we apply to model selection. Match the technique to the payload.

Good fit for imagingDo not image this
Large, static system prompts and tool docsAnything byte-exact: IDs, hashes, secrets, keys
Read-only reference context and long docsExact numbers you will compute or quote
Collapsed, older conversation historyRecent turns the model must reason over precisely
Fable 5 or other strong image readersOpus-routed or vision-weaker workloads
Bulk context where gist is enoughAnything where a silent misread is unacceptable

If your workload is a huge, stable instruction block feeding a Fable 5 agent that mostly needs the gist, imaging can be a real win. If it is a compliance workflow moving exact figures and identifiers, the same trick is a quiet liability.

Where this fits in a real cost stack

Imaging context is one lever, and not the first one we would pull. Before reaching for a lossy trick, the boring levers usually win, and they do not risk your data:

Imaging context sits at the aggressive end of that list: high potential savings, real correctness risk, worth piloting on the right payload once the safer levers are in place.

Kevin Riedl

"The pricing physics is real and the research is serious. But a 60% saving that occasionally invents a hash or a name is not a saving, it is deferred debugging. Image the bulk context that only needs the gist, keep every exact value as text, and never point it at a model that reads images poorly."

Frequently Asked Questions

Is rendering context as images safe for production?
Only for context where a silent misread is acceptable, such as large static instructions or read-only reference material fed to a model that reads images well. It is lossy, so keep anything byte-exact (IDs, hashes, secrets, exact numbers) as text. Treat it as a targeted optimisation on the right payload, not a default.
Does imaging context break prompt caching?
It competes with it. Your biggest static context is also the best candidate for prompt caching, which is lossless. On the GPT path pxpipe forgoes native cache markers. So compare imaging against a properly cached baseline, not an uncached one, or you will overstate the win.
Why does Opus do worse than Fable at reading imaged text?
It is model-dependent. On pxpipe's own hex-recall test Fable 5 scored 13 of 15 and Opus 0 of 15, so pxpipe defaults to Fable 5 and GPT-5.6 and makes Opus opt-in. The same trick that saves money on a strong image reader can corrupt context on a weaker one.
Is this the same as DeepSeek-OCR?
It is the same underlying idea, optical context compression, applied differently. DeepSeek-OCR is a model trained to decode text from a small set of visual tokens at about 10x compression. pxpipe is a proxy that images your context for existing commercial APIs that were not trained specifically for it, which is why the loss shows up.
How much can it actually save?
The pxpipe repository claims a 59 to 70% lower end-to-end bill on Fable 5 and shows a demo task at $42.21 text versus $6.06 imaged. Treat that as the tool author's own figure on their own workload and re-measure on yours, against a cached baseline.

Final thoughts

So, genius or absurd? Both. The mechanism is real, image tokens are priced by pixels, and serious research points the same way. But bolting it onto models that were not trained for it trades money for silent errors, and silent errors are the most expensive kind.

Use it the way you would use any aggressive optimisation: deliberately, on the payload that fits, with the exact values kept as text and the safer levers, caching, routing, measurement, already in place. Do that and imaging bulk context is a sharp tool. Turn it on everywhere and it will eventually hand you a confidently wrong answer you never see coming.

Want help wiring caching, routing and cost measurement into your stack?

 Book Free Consultation
Kevin Riedl

10 min read · 03 Jul 2026