Self-Hosting LLMs in the EU: When Open Weights Actually Pay Off
On a spreadsheet, self-hosting an open-weight LLM looks like an easy win. Rent a GPU for a few dollars an hour, run a free model, stop paying per token. The bill goes flat and you own the stack. That is the pitch, and it is half true. The half it leaves out is the part that decides whether you save money or quietly lose it: the GPU is the cheap part. The expensive part is the engineer who keeps the inference server alive, the eval harness that proves a quantised model still answers correctly, and the upgrade cadence that does not stop when a better open model ships two months later.
This is an engineering and process perspective, not a vendor pitch. The numbers below are directional, drawn from public 2026 trends, and you should re-check them against live quotes before you commit, because GPU rates and token prices both move month to month. We set up internal AI on client infrastructure under our AI Enablement work, so the trade-offs here are the ones we actually weigh with clients, not a theoretical model. This post extends our token-cost playbook, which only touches self-hosting, by going deep on the one question that decides it: when does bringing inference in-house beat a hosted API?
Weighing self-hosting against an API?
Book Free ConsultationWhat does self-hosting an LLM actually cost, beyond the GPU?
The GPU line item is the one everyone quotes, and it is genuinely cheap now. An NVIDIA H100 rents for roughly $2 to $4 per hour on specialised EU clouds, with hyperscalers two to three times higher and EU-sovereign providers landing around the $2 mark (directional, re-check before you commit). Run one reserved H100 around the clock and you are looking at something like $1,500 to $3,000 a month for the hardware alone.
That number is real, and it is also the smallest part of the total. The costs that decide the math are the ones the spreadsheet skips:
- Engineer time. Someone has to stand up the inference server, tune batch sizes, manage GPU drivers and CUDA versions, handle model loading and rollback, and keep it running. This is not a one-time setup. It is ongoing operations, and in the EU a competent ML-ops engineer costs far more per month than the GPU does.
- Redundancy. One GPU is a single point of failure. Production usually means at least two, plus a failover plan, which roughly doubles the hardware line and adds load-balancing work.
- Eval and quality assurance. A self-hosted model is your responsibility when it regresses. You need an eval harness that proves the model holds quality after every quantisation choice and every version bump. Without it you are flying blind.
- Upgrade cadence. Open weights move fast. A model that is competitive today is mid-tier in a quarter. Staying current is recurring engineering work, not a fire-and-forget install.
Add it up and the honest framing is the one from our token-cost work: the break-even is governed by engineer time, not GPU rack rate. The model is cheap to run. The discipline around it is not.
Where is the break-even against a hosted API?
There is no single break-even number, because it depends entirely on which API you are replacing. The spread is wide enough that getting this wrong is the most common self-hosting mistake.
- Against an expensive frontier API (top Claude, GPT, or Gemini tier), self-hosting on a reserved GPU can break even at a few million tokens per day, sometimes quoted around the 2 to 5 million range. Frontier output tokens are pricey, so the bar to beat is low.
- Against a cheap open-weight API provider (Together, Fireworks, DeepInfra and similar), the break-even jumps dramatically, commonly cited around 50 million tokens per day or more. Those providers already run optimised infrastructure at scale on thin margins, so you are competing with their economy of scale, not their list price.
The practical read for most teams: if your alternative is a cheap hosted open-weight endpoint, hosted stays cheaper until you are sustaining serious, steady volume. Bursty traffic makes self-hosting worse, because you pay for the GPU whether it is busy or idle, while an API charges only for what you use. The break-even assumes a GPU you can keep meaningfully loaded. Treat the figures above as directional and plug in your own token volume, GPU rate, and engineer cost before you decide. We cover the broader cost-curve picture in what cheap tokens change in your AI architecture.
| Cost driver | Hosted API | Self-hosted | Who wins, and when |
|---|---|---|---|
| Per-token usage | Pay per token, scales linearly | Flat GPU rate regardless of usage | Self-hosted at high steady volume; API at low or bursty volume |
| Idle time | $0 when not in use | You pay for the GPU 24/7 | API, unless the GPU stays loaded |
| Ops and maintenance | Provider's problem | Your engineers, ongoing | API, almost always |
| Model upgrades | Provider ships them | You re-deploy and re-eval | API |
| Data residency control | Depends on region and contract | Full control on your infra | Self-hosted |
| Latency tuning | Fixed by provider | You own it end to end | Self-hosted, if you have the skill |

"Most teams self-host too early. They price the GPU, not the engineer who has to keep it alive, and a few months in the hosted API would have been cheaper and less work."
When does data residency force self-hosting regardless of cost?
Cost is only one axis. The other is governance, and for some EU workloads it overrides the math entirely. If you process personal or regulated data and you cannot send it to a non-EU endpoint, then the cost comparison is moot. The question stops being "is self-hosting cheaper" and becomes "what is the cheapest compliant option."
Under GDPR, what matters is where inference runs and where the data lands, not just where the data is stored at rest. A signed data processing agreement, EU data residency, purpose limitation, and a documented architecture showing exactly where personal data is processed are the building blocks of a compliant setup. Self-hosting open weights on your own EU infrastructure gives you the lowest external subprocessor exposure, because no third party touches the data at inference time. It also carries the highest operational burden, since serving, logging, access control, and rollback all become your responsibility. We go deeper on the residency options in EU data residency for AI apps.
There is a regulatory layer on top. The EU AI Act is phasing in, and the timeline is provisional and still moving, so treat any specific date as subject to change. As of mid-2026, general-purpose AI model obligations are already in application, broader high-risk obligations have been pushed back under a political agreement reached in May 2026, and enforcement powers are scheduled to ramp through 2026 and beyond. The practical takeaway for an EU team is not a date. It is that controlling where your model runs is becoming a governance asset, not just a cost line, and self-hosting is one way to keep that control in your own hands. Confirm the current obligations against the official sources before you build a compliance claim on them.
What is the production stack if you do self-host?
If volume or compliance has decided it for you, the 2026 production stack is well established, and it is not the same thing you used to prototype on your laptop. A local single-stream runner is fine for experiments and wrong for production, because it leaves most of the GPU idle.
- Inference engine: vLLM. vLLM's continuous batching and PagedAttention let a single GPU serve many concurrent requests at far higher aggregate throughput than a naive single-stream setup. Public 2026 benchmarks put it on the order of eight to nine times the aggregate tokens of a simple runner on the same H100 at the same quantisation. Throughput, not single-request speed, is what makes the GPU economics work, because it is what keeps the card loaded. SGLang and TensorRT-LLM are credible alternatives in the same class.
- Quantised open weights. You almost always run a quantised model, not full precision, to fit a useful model on one GPU and serve more concurrency. On H100-class hardware, FP8 keeps close to full-precision quality with roughly half the memory. Where you need to go smaller, AWQ 4-bit is generally the strongest of the 4-bit formats on real tasks (public comparisons put it around the high-90s in quality retention), with GPTQ close behind. The catch: 4-bit quantisation can degrade noticeably on hard reasoning and math while staying fine for summarisation, classification, and extraction. Which is why the eval below is not optional.
- The model itself. Llama, Qwen, DeepSeek, and Mistral-class open weights are the usual candidates. Picking among them is its own decision, and we work through it in our open-weight model comparison.
How do you know the cheaper path actually held quality?
This is the part teams skip and then regret. Every choice in a self-hosted stack, the model, the quantisation, the batch settings, can quietly drop quality, and a cheaper path that silently answers worse is the most expensive outcome of all. The only honest defence is an eval harness that scores your actual tasks, not a public benchmark.
We are deliberately humble about this. Building and maintaining good evals is harder than standing up the inference server, and most of the ongoing engineering cost of self-hosting lives here rather than in the GPU. You need a representative task set, a scoring method you trust, and a gate that runs before every model or quantisation change. Without it you cannot tell whether your FP8 swap cost you two points of accuracy on the cases that matter. This is the same discipline we apply on production AI engagements such as Twinsoft AI: the model choice is only safe on top of an eval that proves it held the bar.
So should you self-host? A decision checklist.
Run these questions in order. If the honest answer to the first two is no, default to a hosted API and revisit later.
- Is data residency forcing your hand? If you cannot legally send the data to a non-EU endpoint and no compliant EU-hosted API fits, self-hosting may be the cheapest compliant option regardless of token math. This alone can decide it.
- Is your volume high and steady? Against a cheap open-weight API, you generally need to be sustaining tens of millions of tokens per day on a GPU you can keep loaded before self-hosting wins. Bursty or low volume favours the API.
- Do you have the ops capacity? Self-hosting is recurring engineering work: drivers, redundancy, upgrades, monitoring. If your team cannot own that without dropping product work, the GPU savings evaporate.
- Can you prove quality with evals? If you cannot measure whether a quantised open model holds your quality bar, you are not ready to depend on one in production.
- Have you priced the full picture? GPU plus redundancy plus engineer time plus eval upkeep, against the all-in API cost. Compare totals, not the GPU line against the token line.
For most early and mid-stage products, the answer is still: stay on a hosted API, and pick an EU-resident one if residency matters. Self-host when volume or compliance forces the question, not before. If you want help running that comparison on your own numbers and infrastructure, that is exactly what our AI Enablement work is for.
Final thoughts
Self-hosting open-weight LLMs in the EU pays off in two situations, and you should be honest about which one you are in. The first is sustained high volume on a GPU you can keep loaded, where the flat hardware cost beats per-token pricing once you count the full operational picture, not just the rack rate. The second is data residency, where keeping inference on your own EU infrastructure is a governance decision that can override cost entirely.
Outside those two cases, a hosted API, ideally an EU-resident one, is usually cheaper and far less work, and most teams reach for self-hosting too early because they price the GPU and forget the engineer. The numbers here are directional and the GPU and token markets both move monthly, so re-check before you commit, prove every model and quantisation choice with an eval, and treat self-hosting as a decision you grow into, not one you start with.
Want this run against your numbers and infra?
Book Free Consultation