Rendering your prompt as an image to cut LLM costs 60%: genius, or just absurd?

TL;DR

A viral tool called pxpipe renders the bulky, mostly static parts of a request, the system prompt, tool docs, old history and pasted code, into dense PNG images before they reach the model, because an image is priced by its pixel size, not by how much text it holds. On dense content that packs more characters per token, this can cut a Claude Fable 5 bill by a claimed 59 to 70 percent end to end, with the repository showing one task at 42.21 dollars as text versus 6.06 dollars imaged. The idea is grounded in real research, DeepSeek-OCR optical context compression and follow-up work, not just a hack. The catch is that it is lossy and fails silently: the model confidently misreads exact strings, scoring 13 of 15 on Fable and 0 of 15 on Opus for exact hex recall on the tool's own test, so IDs, hashes, secrets and exact numbers must stay as text. It is also model-dependent, adds encoding latency, and competes with prompt caching. Treat it as a targeted optimisation for large, static, gist-only context on a strong image-reading model, not a default, and always keep exact values as text and measure against a cached baseline. Figures are the tool author's own, from mid-2026, so re-check before you commit.

Related service: AI Enablement

A trick is going around: run Claude Fable 5 roughly 60% cheaper by taking the heavy parts of your request, the system prompt, tool docs, old history, pasted code, and turning them into an image before the request reaches the model.

The reasoning sounds absurd, which is exactly why it went viral. An image costs the model a fixed number of tokens based on its pixel size, not on how much text is crammed inside it. So you can pack a lot of characters into one dense picture and pay for the pixels, not the prose.

The tool making the rounds is pxpipe, an open-source (MIT) local proxy. It sits between your machine and the API, and its pipeline renders the bulky, mostly-static context into dense PNG "pages" before the request leaves your laptop. The repository's own demo shows one multi-step task costing $42.21 as plain text versus $6.06 with pxpipe. It claims a 59 to 70% lower end-to-end bill on Fable 5 at current list prices.

So is this genius or nonsense? The honest answer: the physics is real, the research behind it is serious, and the failure mode is bad enough that for most production work you should not turn it on by default. Here is the whole picture, with the caveats the viral version leaves out.

Want a straight answer on where your LLM spend is actually going?

Book Free Consultation

Why this actually works: the pricing physics

Text and images are billed at the same per-token rate, but they are counted very differently.

Text is tokenized by content. More characters, more tokens, more cost. Anthropic prices an image instead by its pixel area, using a rough formula of (width x height) / 750 tokens, and it caps the count (images are resized so the long edge stays around 1568px, which lands near 1,600 tokens per image at the top end). The key point: that number does not care whether the image is a blank rectangle or a wall of text.

On real Claude Code traffic, pxpipe measures dense content like code and JSON packing around 3.1 characters per image-token, against roughly 1.9 characters per text-token. Once your text is denser than about 19 characters per token, imaging it starts to pay. So a block that would cost 25k text tokens can come back as roughly 2.7k image tokens. That is where the 60% headline comes from.

This is what the model actually receives instead of your text:

A dense wall of whitespace-minified text rendered as a single image, roughly 48k characters packed into about 2.7k image tokens, with an OCR instruction banner across the top — Around 48k characters of system prompt and tool docs, about 25k tokens as text, rendered as roughly 2.7k image tokens on one page. Source: the pxpipe repository (MIT), used for illustration.

One nuance the viral version glosses over: because each image is capped near 1,600 tokens, you cannot pour an unlimited context onto a single giant canvas. The tool renders many pages, not one poster. The savings are real, but they come from tiling dense text across several capped images, not from magic.

This is not a hack. It is a research direction.

The counterintuitive part, that pictures of text can be cheaper than text, is not a proxy-tool gimmick. It is an active area of research.

In October 2025, DeepSeek published DeepSeek-OCR: Contexts Optical Compression, showing a vision model can decode text from a small set of visual tokens at roughly 10x compression while holding around 97% OCR precision when the compression ratio stays under 10x. Andrej Karpathy picked it up to argue that text tokens might be wasteful "historical baggage," and that feeding models images of text could turn out more efficient. Follow-up papers, such as Text or Pixels? It Takes Half, report similar token savings on visual text inputs.

So the idea is legitimate and the long-context economics are genuinely interesting. pxpipe is just an early, aggressive attempt to cash that in on today's commercial APIs, before the models are trained to do it well. And "before the models are trained to do it well" is where the trouble starts.

The catch that makes it absurd for most work

Rendering text as an image is lossy, and the loss is silent.

When the model misreads an imaged character, it does not throw an error or flag low confidence. It confidently makes something up. pxpipe's own README documents the failure honestly: on a needle-in-a-haystack test asking the model to recall exact 12-character hex strings buried in dense imaged content, Fable 5 scored 13 out of 15, and Opus 0 out of 15. The README describes one real case where the model recalled a person's name from imaged chat history and got it confidently wrong.

That is the whole risk in one sentence: anything you need back byte-exact must stay as text. IDs, hashes, secrets, exact numbers, precise names. pxpipe keeps recent turns and exact identifiers as text alongside the images for exactly this reason.

A few more things the headline skips:

It is model-dependent. pxpipe defaults to Fable 5 and GPT-5.6, the models that read dense imaged text best. Opus 4.8 and GPT-5.5 are opt-in only, because they misread imaged context more. The trick that saves you 60% on one model can quietly corrupt context on another.
It adds latency. Encoding large requests to PNG takes time before the request even leaves your machine.
It interacts with prompt caching. Your biggest, most static context, the system prompt and tool docs, is also the ideal candidate for prompt caching, which already discounts repeated tokens heavily. On the GPT path pxpipe forgoes native cache markers. Imaging context and caching context both target the same tokens, so the real comparison is against a properly cached baseline, not a naive one.

When it is worth it, and when it will burn you

This is not a yes or no. It is a routing decision, the same discipline we apply to model selection. Match the technique to the payload.

Good fit for imaging	Do not image this
Large, static system prompts and tool docs	Anything byte-exact: IDs, hashes, secrets, keys
Read-only reference context and long docs	Exact numbers you will compute or quote
Collapsed, older conversation history	Recent turns the model must reason over precisely
Fable 5 or other strong image readers	Opus-routed or vision-weaker workloads
Bulk context where gist is enough	Anything where a silent misread is unacceptable

If your workload is a huge, stable instruction block feeding a Fable 5 agent that mostly needs the gist, imaging can be a real win. If it is a compliance workflow moving exact figures and identifiers, the same trick is a quiet liability.

Where this fits in a real cost stack

Imaging context is one lever, and not the first one we would pull. Before reaching for a lossy trick, the boring levers usually win, and they do not risk your data:

Prompt caching for the static prefix, which is lossless and already large.
Model routing: cheap models for mechanical work, strong models for judgment. See how we route work across Fable, Opus, Sonnet and Haiku.
Measuring cost per completed task, not price per token, which is the number that lands on your invoice. See cheaper per token, more expensive per answer.
A gateway to centralise fallback, caching and spend limits. See our LLM gateway comparison.
Self-hosting or open weights when volume and data residency justify it, covered in the real cost of self-hosting LLMs in the EU.

Imaging context sits at the aggressive end of that list: high potential savings, real correctness risk, worth piloting on the right payload once the safer levers are in place.

"The pricing physics is real and the research is serious. But a 60% saving that occasionally invents a hash or a name is not a saving, it is deferred debugging. Image the bulk context that only needs the gist, keep every exact value as text, and never point it at a model that reads images poorly."

Frequently Asked Questions

Is rendering context as images safe for production?

Only for context where a silent misread is acceptable, such as large static instructions or read-only reference material fed to a model that reads images well. It is lossy, so keep anything byte-exact (IDs, hashes, secrets, exact numbers) as text. Treat it as a targeted optimisation on the right payload, not a default.

Does imaging context break prompt caching?

It competes with it. Your biggest static context is also the best candidate for prompt caching, which is lossless. On the GPT path pxpipe forgoes native cache markers. So compare imaging against a properly cached baseline, not an uncached one, or you will overstate the win.

Why does Opus do worse than Fable at reading imaged text?

It is model-dependent. On pxpipe's own hex-recall test Fable 5 scored 13 of 15 and Opus 0 of 15, so pxpipe defaults to Fable 5 and GPT-5.6 and makes Opus opt-in. The same trick that saves money on a strong image reader can corrupt context on a weaker one.

Is this the same as DeepSeek-OCR?

It is the same underlying idea, optical context compression, applied differently. DeepSeek-OCR is a model trained to decode text from a small set of visual tokens at about 10x compression. pxpipe is a proxy that images your context for existing commercial APIs that were not trained specifically for it, which is why the loss shows up.

How much can it actually save?

The pxpipe repository claims a 59 to 70% lower end-to-end bill on Fable 5 and shows a demo task at $42.21 text versus $6.06 imaged. Treat that as the tool author's own figure on their own workload and re-measure on yours, against a cached baseline.

Final thoughts

So, genius or absurd? Both. The mechanism is real, image tokens are priced by pixels, and serious research points the same way. But bolting it onto models that were not trained for it trades money for silent errors, and silent errors are the most expensive kind.

Use it the way you would use any aggressive optimisation: deliberately, on the payload that fits, with the exact values kept as text and the safer levers, caching, routing, measurement, already in place. Do that and imaging bulk context is a sharp tool. Turn it on everywhere and it will eventually hand you a confidently wrong answer you never see coming.