AI MVP Scope Template: Acceptance Criteria, Eval Set, Launch Gate, and What Belongs in the SoW
Scoping an AI MVP is different from a normal software MVP because the system is probabilistic, not deterministic. "User can reset password" is a binary you can write as a pass-fail test. "The assistant answers correctly" is not, because the same question can produce different answers when the model version, temperature, or phrasing shifts. So you replace "it works" with four things that must be named in the statement of work before money changes hands: a versioned eval set with golden answers, a target metric and threshold on that set, a launch gate that decides go-live, and explicit handling of the uncertain case plus a rollback. If a vendor's scope says "build an AI assistant" with none of those, they scoped a demo, and you will pay for the gap during hardening. A copy-ready template is below.
This is the document founders send around internally before they buy. It is written from the build side, with the awkward questions made explicit. Regulatory dates are current as of mid-2026 and hedged where they are moving.
Want this scope pressure-tested before you sign with an agency?
Book Free ConsultationWhy "it works" does not work for AI
In normal software, acceptance is binary: given input X, assert output Y, done. An LLM gives different output for the same input across runs, so "the chatbot answers accurately" has no test behind it. There is no number, no defined set, no floor, nothing to sign off and nothing to dispute when quality is poor. The fix is statistical, not binary. You accept a measured rate on a defined set, ideally averaged across several runs, and for reproducible regression checks you pin the temperature to zero where the use case allows. The second trap follows immediately: a fix for one case silently breaks another, so the eval set has to run as a gate on every prompt or model change, not once at the end. That single shift, from "it works" to "it clears this bar on this set, every time," is the whole point of scoping an AI build well.
What an AI MVP statement of work must pin down
Each section earns its place. The load-bearing ones are the eval set, the acceptance criteria, and the launch gate.
| Section | What it must pin down |
|---|---|
| Problem + one outcome | One sentence. The single user job the MVP must do. Everything not serving it is out of scope. |
| In and out of scope | Two lists. The out-of-scope list is the load-bearing one: name the tempting things you are not building (multi-language, voice, fine-tuning, mobile) so they become change requests, not assumptions. |
| Functional + AI behavior spec | Normal requirements plus the AI behavior: task, tone, refusal behavior (when it must say "I do not know"), citation requirement, and fallback on low confidence or no retrieval hit. This is where you encode the uncertain case. |
| Acceptance criteria | Target metric plus threshold on a named eval set. Never "it works." Examples below. |
| The eval set | How many cases (around 100 is a workable MVP floor, ~200 for a fuller set; ten tells you almost nothing), who owns the golden answers (a domain expert on your side, not the vendor alone), and how each case is scored: deterministic checks for objective fields, an LLM-as-judge with binary pass-fail for nuanced quality, human review to calibrate. |
| Launch gate + rollback | The measurable bar to go live, agreed by product, engineering, and QA before tests are written, plus an auto-rollback trigger and a kill criterion. |
| Data | Sources, provenance, rights to use them (training and retrieval), PII handling, and residency. The model API that sees prompts is a sub-processor: it needs a DPA and a no-training, zero-retention config, not consumer terms. |
| Non-functionals | Latency budget (for streaming UIs, time-to-first-token is the primary SLA), cost-per-action budget (output tokens cost several times input, and agentic flows fan one action into many calls), and logging of every model call. |
| Security and compliance | Auth, permission model and tenant isolation, GDPR (record of processing, lawful basis, DPIA for high-risk, sub-processor DPAs), and EU AI Act transparency where the app talks to users. |
| Milestones, payment, IP, handover | Phased discovery, build, eval and hardening, launch, with payment per phase. IP assigned to you on payment. A named handover artifact list. |
Acceptance criteria: wrong versus right
This is the section that decides whether you can hold a vendor to anything.
Wrong, because there is no number, no set, and no floor: "the chatbot answers customer questions correctly," "the assistant is accurate and helpful," "the model rarely hallucinates," "it works well in testing."
Right, on the agreed 100-case eval set with fixed golden answers:
- at least 90% faithful answers, grounded in retrieved context, scored by an LLM-as-judge on a binary pass-fail and calibrated against human labels;
- at least 85% answer relevance;
- under 2% harmful or policy-violating outputs, as a hard gate where any single harmful output blocks launch;
- refuses or escalates on at least 95% of the unanswerable test cases;
- p95 time-to-first-token under 2 seconds, p95 full response under 4 seconds;
- cost under EUR 0.05 per conversation at the agreed model;
- no regression below any floor on the eval run before a deploy.
Treat the exact thresholds as starting points to negotiate against your risk, not gospel. Vendors commonly cite faithfulness at or above 0.75 and hallucination under 5% (1% for high-stakes) as production starting points; set yours by how much a wrong answer costs you. The eval set you scope here is the same evidence an investor's technical diligence will ask for, which we cover in the technical due diligence checklist for AI MVPs, and the scoring methods are in when LLM evals are worth building.
The copy-ready scope template
Paste it, fill the brackets, delete what does not apply. Send it around before you take a single proposal.
AI MVP Scope / SoW, [Project Name]
Date [date] · Version [v0.1] · Owner [name]
1. Problem + one outcome. Problem: [one sentence]. The one outcome this MVP must deliver: [user] can [do X] so that [Y].
2. Scope. In scope: [feature 1], [feature 2]. Out of scope, change request only: [multi-language], [voice], [fine-tuning], [mobile].
3. Functional + AI behavior. Functional: [list]. AI behavior: task [exactly what], tone [concise, no speculation], citations [must or must not ground in sources], refusal [when out of scope or low confidence, say "I do not know" or escalate], fallback [secondary model, cached answer, or human handoff].
4. Acceptance criteria. On the agreed [100]-case eval set: faithfulness >= [90]%; relevance >= [85]%; harmful outputs < [2]% (any single fail blocks launch); correct refusal >= [95]% on unanswerable cases; p95 time-to-first-token < [2]s; p95 full response < [4]s; cost per conversation < [EUR 0.05] at model [X]; no regression below any floor before deploy.
5. Eval set. Size [100] cases ([X] happy path, [Y] edge, [Z] unanswerable). Golden-answer owner: [client domain expert]. Scoring: deterministic checks for [objective fields], LLM-as-judge pass-fail for [quality], human review for calibration. Stored and versioned in [location].
6. Launch gate + rollback. Go live when all Section 4 floors are met, signed off by [product] + [engineering] + [QA]. Rollback: if [metric] drops below [floor] over a [window], auto-revert. Kill criterion: do not ship if [faithfulness < X% or any harmful output].
7. Data. Sources [list], provenance and rights [per source], PII [what and how handled], residency [EU region]. Model provider: DPA signed, no-training, zero-retention.
8. Non-functionals. Latency and cost budgets as in Section 4. Observability: log every model call (prompt, response, tokens, cost) to [tool], daily cost alert at [threshold].
9. Security + compliance. Auth [method], permissions and tenant isolation [model], GDPR (record of processing, lawful basis, DPIA if high-risk, sub-processor DPAs), EU AI Act Article 50 transparency from 2 August 2026 if the app talks to users, with no plan that assumes an unenacted delay.
10. Milestones + payment. Discovery, build, eval and hardening, launch, payment per phase. IP assigned to [client] on payment. Handover: eval set and results, prompt and model registry, architecture and data-flow diagram, runbook, logs access, credentials. Sign-off: client [__] vendor [__] date [__].

"If the scope cannot tell you, in numbers, what good enough looks like and what happens when the model is wrong, it is not a scope. It is a wish. The eval set and the launch gate are the two lines that turn an AI demo into something you can actually buy."
A note on the AI Act
If your app interacts with users, the EU AI Act's Article 50 transparency duties, including telling users they are dealing with AI, apply from 2 August 2026 and are largely unaffected by the proposed changes. Most high-risk obligations were also due then. A provisional agreement in 2026 would push the standalone high-risk duties to late 2027, but as of mid-2026 that is not yet law and takes effect only on formal adoption. Do not scope your compliance plan around a delay that has not been enacted.
Frequently Asked Questions
How do you write acceptance criteria for an AI feature?
What is an eval set?
Who should own the eval set?
How are eval cases scored?
What is a launch gate for an LLM app?
What belongs in an AI statement of work?
Why can I not just write "the AI answers correctly" in the scope?
How do I handle the case where the AI is wrong or unsure?
What data clauses does an AI MVP SoW need?
Does the EU AI Act affect my AI MVP scope?
Final thoughts
An AI MVP scope is only as good as its eval set and its launch gate. Those two lines turn a vague "build us an assistant" into something a vendor can be held to and an investor will later respect.
So before you take a proposal, write the one outcome, draw the out-of-scope line, define the metric and the floor on a set your own expert owns, decide what happens when the model is unsure, and name the bar that lets it go live. The template above is the starting point. Fill it in first, and the proposals you get back will be about the same thing, which is the only way to compare them.
Want the eval set and launch gate built into your MVP scope?
Book Free Consultation