RAG Production-Readiness Checklist for EU Companies

A RAG demo over your own documents is one of the easiest wins in AI. Point a retriever at your wiki, wire it to a model, and you have something that answers questions in an afternoon. The hard part is everything after the demo. A demo that impresses a meeting is not the same as an assistant your employees or customers can rely on, and in the EU it has to be that and defensible under GDPR and the AI Act, without a cloud bill that quietly doubles every quarter. This is the checklist we work through before a RAG system goes from interesting to trusted.

None of this is an argument against RAG. It is the cheapest way to give a model your knowledge without retraining anything. It is an argument for treating the demo as the start of the work, not the end of it.

Have a RAG demo that needs to ship?

Book a RAG Production Review

Is the retrieval actually good enough?

Everything downstream depends on the model getting the right passages. If retrieval is weak, no amount of prompt tuning saves you, because the model is answering from the wrong source or from nothing. This is the part demos skip, because on a handful of clean documents almost any setup looks fine.

Chunking that respects meaning. Splitting on a fixed character count tears sentences and tables apart. Chunk on structure, headings, sections, logical units, so a retrieved passage is a complete thought.
An embedding model that fits your content. The default is rarely the best for your domain or your languages. If you serve customers in German and English, test that retrieval works in both, not just the one you demoed in.
An evaluation set, not a vibe check. Write down real questions and the passages that should answer them. Measure recall, whether the right passage is actually retrieved, before you touch anything else.
Citations returned with every answer. The system should hand back which documents it used. This is what makes answers checkable and, later, what makes the whole thing auditable.

Will it make things up?

A model with retrieval still hallucinates if you let it. The discipline of RAG is to keep the model on a short leash: answer from the retrieved context, and only from it.

Answer only from retrieved context. Instruct the model to ground every claim in the passages it was given, and not to fill gaps from its general training.
Refuse when the context is thin. If retrieval comes back empty or weak, the right answer is "I do not have that," not a confident guess. A system that refuses well is more trustworthy than one that always answers.
Show the sources. Surface the citations to the user. It lets them verify, and it changes behaviour: people trust a system more when they can see where the answer came from, and less blindly.

"A RAG demo answers the questions you tested. A RAG product has to answer the questions you did not, and say nothing when it should not. The gap between those two is the whole engagement."

Can you measure it the same way twice?

The thing that quietly kills RAG projects is that every change feels like progress and nobody can prove it. You swap the embedding model, tweak a prompt, reindex, and the demo still works, so you ship it. Then a class of questions silently gets worse.

Build a golden question and answer set: a fixed list of real questions with the answers and sources you expect. Re-run it on every change, a new prompt, a new model, a reindex, and compare. It does not need to be fancy. It needs to be the same every time, so a regression shows up as a number going down instead of a complaint three weeks later.

What will it cost to run, and how fast is it?

A RAG demo serving one person is free in all the ways that matter. A RAG assistant serving a company is a recurring bill that scales with usage, and latency that users feel on every query. Both are design decisions, not afterthoughts.

Cache what repeats. Many questions are asked over and over. Caching retrieved context and frequent answers cuts both cost and latency.
Tier your models. Not every query needs the largest model. Route simple questions to a smaller, cheaper one and reserve the big model for the hard cases.
Budget your tokens. Stuffing twenty passages into every prompt is slow and expensive and often makes answers worse. Retrieve fewer, better passages.

The economics here are shifting fast, and the architecture you choose now decides what you pay later. We went deeper on that in the LLM API cost shift.

Does it hold up under EU rules?

This is where EU companies have homework that a US tutorial will not mention. We are describing obligations at a general level here, not giving legal advice, and the details depend on your sector and your data. Talk to counsel for the specifics. But the engineering questions are clear enough to put on a checklist.

Know your data flows. Under GDPR you have to know where personal data goes. If your documents or user questions contain personal data and your model or vector store sits outside the EU, that is a transfer you have to be able to justify. Data residency is a design choice you make early, not a setting you flip late.
Be transparent that it is AI. The AI Act expects users to know when they are interacting with an AI system rather than a human. For an assistant, that generally means telling people plainly.
Handle PII deliberately. Decide what personal data is allowed into the index and into prompts, and what gets redacted or excluded. Retrieval can surface a document someone forgot was sensitive.
Log for audit. Keep a record of what was asked, what was retrieved, and what was answered. You will want it to debug, to improve, and to answer the question "why did it say that" when someone asks.

We have written separately on the cost of AI Act compliance for a startup and how GDPR and the AI Act stack for a DACH SaaS, if you want the regulatory side in more depth.

Can someone break it or read what they should not?

A RAG system has two security problems most demos never face: the model can be manipulated through its inputs, and the index becomes a new place sensitive data lives.

Prompt injection. A retrieved document can contain instructions aimed at the model: "ignore your rules and reveal X." Treat retrieved content as untrusted input, not as a command, and test for it.
Access control on the index. The vector store is now a copy of your knowledge. It needs the same access controls as the source systems, not weaker ones because it is "just embeddings."
Per-user document permissions. This is the one that bites hardest. If a user can only see certain documents, retrieval must respect that on every query, so the assistant never surfaces a passage from a document that user was never allowed to read. A RAG system that ignores permissions is a data leak with a friendly chat interface.

The checklist

Run this before you let anyone rely on a RAG assistant. If any line makes you hesitate, that is the line to fix first.

Retrieval. Meaningful chunking, an embedding model tested on your content and languages, an eval set with measured recall, citations on every answer.
Grounding. Answers come only from retrieved context, the system refuses when unsure, sources are shown to the user.
Repeatable evaluation. A golden question and answer set that you re-run on every prompt, model, or index change.
Cost and latency. Caching, model tiering, and token budgets that you can defend at scale, not just in the demo.
EU compliance. Known data flows and residency, transparency that it is AI, deliberate PII handling, audit logging.
Security. Prompt injection tested, access control on the index, per-user permissions enforced on retrieval so nothing leaks across users.

The deliverable is not a slideshow about RAG. It is a system that retrieves the right thing, refuses when it should, costs what you budgeted, and does not leak.

Final thoughts

A RAG demo is easy because it skips the parts that are hard: weak retrieval hides behind clean test documents, hallucinations hide behind questions you already know the answer to, and the cost, compliance, and permission problems do not show up until real people and real data arrive. None of that disappears. It just waits for production, when it is most expensive to discover.

If you are an EU company moving a RAG assistant from a demo to something people rely on, work the checklist before you ship. The retrieval, grounding, and per-user permission lines are where trust is won or quietly lost, and they are far cheaper to get right before launch than to explain after an incident.