Kevin Riedl

8 min read · 29 Jun 2026

What an Internal AI Assistant Actually Costs in the DACH Region (2026)

A leader asks a simple question: what does an internal AI assistant over our company documents actually cost per person per month? The answer they get back is usually one of two unhelpful things. Either a scary five-figure number that assumes a self-hosted GPU cluster nobody needs, or a hand-wave that says "almost nothing, tokens are cheap now." Both are wrong, because the real cost of a RAG assistant over SharePoint, Confluence, and Google Drive is not one big line. It is several small ones, plus the part everybody forgets: keeping it running.

Engineering and process perspective, not a vendor pitch. This is about rolling AI out internally, a different job than building an AI product for your customers; we do the internal-setup side under AI Enablement. The numbers below are directional, drawn from public 2026 pricing, and your numbers will differ. Re-check current pricing before you budget.

Want a real number for your setup?

 Book Free Consultation

What actually drives the cost of an internal AI assistant?

Almost everyone fixates on the LLM bill, and for an internal assistant that is rarely the biggest number. The cost is shaped by three things you control before a single token is spent:

  • How many people use it, and how hard. Ten power users who query it 30 times a day cost more than 200 occasional users who open it twice a week. The unit that matters is queries per day, not headcount.
  • How much it has to read to answer. Every answer pulls retrieved chunks of your documents into the prompt. Stuff ten pages of context into each call and your input-token bill multiplies, even though the question was one line.
  • How fresh the index has to be. Re-embedding documents the moment they change costs more than a nightly sync. Most internal knowledge does not change minute to minute, which is a saving most teams leave on the table.

Get those three assumptions on paper first. They move the monthly bill more than any vendor choice you will make later.

What are the cost components, line by line?

Here is the full set of line items for a self-managed RAG assistant, with directional 2026 ranges. Treat the numbers as a snapshot of public pricing, not a quote, and re-check before you commit.

ComponentWhat it isRough monthly costNotes
Embeddings (initial + updates)Turning your docs into vectors so they can be searched~$0 to $30One-off bulk embed is cheap. OpenAI text-embedding-3-small is about $0.02 per 1M tokens; the large model about $0.13. A mid-size corpus costs single-digit dollars to embed once, then near-zero for daily deltas.
Vector databaseStoring and searching those vectors~$0 to $150+Free tiers cover a prototype. Managed production tiers (Pinecone, Qdrant Cloud, Weaviate Cloud) commonly start around $50 to $150/month at modest scale; a self-hosted Qdrant on your own VM can be cheaper at the cost of ops.
LLM answer tokensThe model that writes each answer from retrieved context~$20 to a few hundredThe variable line. Driven by queries/day times context size times model price. A mid-tier model plus tight context keeps this small; routing every query to a frontier model with bloated context is how it explodes.
Retrieval + orchestrationThe glue: query handling, reranking, permission filtering~$0 to $40Mostly your own compute. An optional reranker adds a small per-query cost; permission-aware retrieval adds latency, not much spend.
HostingApp server, API gateway, logs, monitoring~$20 to $100A small container plus a managed gateway. Modest and flat until you scale users.
MaintenanceKeeping it correct: connector upkeep, eval runs, model upgradesThe real numberNot a SaaS line. It is engineer time, and over a year it usually dwarfs every row above. The honest budget puts a recurring number here.

Notice the pattern. The infrastructure rows are surprisingly cheap in 2026. The cost that decides whether the project is worth it sits in the last row, and it is the one no vendor quote includes.

Kevin Riedl

"The vector DB and the tokens are the cheap part. The expensive part is the engineer who keeps the answers correct after the documents change. Budget for that or the project rots."

Build or buy: which is actually cheaper?

The packaged "AI over your knowledge base" products quote a per-seat price, often somewhere in the range of a paid productivity seat. That is clean and predictable, and for a small team with generic documents it can be the right call. The trade-off shows up in two places: you pay per seat whether a user queries it once a month or fifty times a day, and your data routing and retrieval logic are whatever the vendor decided.

A self-managed setup inverts that. The per-query cost is low and you only pay for what runs, but you carry the build and the maintenance. The break-even is not about seat count alone, it is about control. The moment you need permission-aware retrieval that mirrors your SharePoint and Confluence access rules, or the data cannot leave your infrastructure, the off-the-shelf seat price stops being the whole story. We go deeper on the rollout decision in how to roll out AI internally without creating shelfware.

Where does DACH data residency add cost?

For a DACH company, the question is rarely capability and almost always where the data goes. Personal data, customer records, and internal documents under GDPR and the EU AI Act usually cannot be sent to a model endpoint that processes outside the EU. That constraint adds cost in a few concrete places, none of them ruinous if you plan for them:

  • EU-region model endpoints. The major providers offer EU data-residency options (for example Azure OpenAI Data Zone deployments in Sweden Central or Germany West Central). The processing cost is broadly comparable to the standard rate; the surcharge for an EU region, where one applies, is typically modest rather than a multiple.
  • EU-hosted vector DB and app. Pinning your vector database and app server to an EU region is a configuration choice, not a price tier. It mostly removes the cheapest global options from the table, which nudges hosting up a little.
  • The compliance work itself. The real residency cost is the review: the data-processing agreement, the records of processing, and confirming no maintenance access reaches the data from outside the EU. That is one-time effort plus a smaller recurring review, and it is far cheaper done at design time than retrofitted.

Done from the start, residency is a design choice that barely moves the per-seat number. Done as an afterthought, it is a rebuild, which is the expensive version.

What does it cost per seat, per month, in a worked example?

Illustrative only. Your numbers will differ, and you should re-check current pricing before you trust this. The point is the shape of the bill, not the exact figure.

Assume a DACH company with 50 active users, each running roughly 10 queries a day (about 11,000 queries a month), over a corpus of a few hundred thousand document chunks, answered by a mid-tier EU-region model with tight retrieval (a handful of chunks per answer), nightly re-indexing, and a managed vector DB.

LineDirectional monthly cost
Embeddings (nightly deltas after the one-off bulk embed)~$5 to $20
Managed vector DB (production tier, EU region)~$50 to $150
LLM answer tokens (mid-tier model, tight context)~$60 to $250
Hosting, gateway, monitoring~$30 to $100
Infrastructure subtotal~$150 to $520 / month
Divided across 50 seats~$3 to $10 per seat / month
Maintenance (engineer time, amortised)The dominant line over a year

So the running infrastructure of a 50-seat internal assistant often lands in the single-digit-euros-per-seat range. That number surprises people who expected a four-figure monthly bill. The catch is the line we left for last: maintenance is what turns a cheap-looking setup into a real annual cost, and it is the one that decides whether the assistant stays trustworthy.

How do you keep it cheap without letting it rot?

The same discipline that keeps a production AI build affordable applies here. The cost levers, in the order they pay off:

  • Route to the cheapest capable model. Most internal questions do not need your most expensive model. Reserve the frontier model for the hard minority and the per-query cost drops sharply.
  • Retrieve less, more precisely. The single biggest token waste is stuffing too many chunks into each answer. Good retrieval plus a reranker sends the model a few relevant chunks, not ten pages. This is the lever with the largest effect on the LLM line.
  • Cache the repeats. Internal teams ask the same handful of questions far more than customer-facing users do, which makes caching unusually effective. The deeper token mechanics are in how to cut LLM token costs in 2026.
  • Re-index on a schedule, not on every keystroke. Nightly or near-real-time sync is enough for most internal knowledge and keeps embedding spend near zero.

One humility note that no cost table captures: a cheap assistant that quietly gives wrong answers is the most expensive outcome of all. You need an eval set, a way to measure answer quality after the documents change, and someone whose job is to watch it. That maintenance line is not optional padding. It is the difference between a tool the team trusts and one they stop opening. We saw the same discipline pay off in production AI work like Twinsoft AI, where the eval harness is what made cost optimisation safe.

Final thoughts

The honest answer to what an internal AI assistant costs in the DACH region in 2026: the running infrastructure is cheaper than most leaders expect, often single-digit euros per seat per month at modest scale, because embeddings, vector storage, and tokens have all collapsed in price. The cost that actually decides the project is maintenance, the engineer time to keep answers correct as documents change, and that line dwarfs the infrastructure over a year.

Data residency for a DACH company adds modest cost when designed in from the start and an expensive rebuild when bolted on later. Get your three assumptions on paper first, seats, queries per day, and how much context each answer needs, then route to the cheapest capable model, retrieve precisely, and budget honestly for the upkeep. A trustworthy assistant your team owns is worth far more than a cheap one that quietly drifts.

Want a costed plan for your infra?

 Book Free Consultation
Kevin Riedl

8 min read · 29 Jun 2026