AI MVP Scope Template: Acceptance Criteria, Eval Set, Launch Gate, and What Belongs in the SoW

Scoping an AI MVP is different from a normal software MVP because the system is probabilistic, not deterministic. "User can reset password" is a binary you can write as a pass-fail test. "The assistant answers correctly" is not, because the same question can produce different answers when the model version, temperature, or phrasing shifts. So you replace "it works" with four things that must be named in the statement of work before money changes hands: a versioned eval set with golden answers, a target metric and threshold on that set, a launch gate that decides go-live, and explicit handling of the uncertain case plus a rollback. If a vendor's scope says "build an AI assistant" with none of those, they scoped a demo, and you will pay for the gap during hardening. A copy-ready template is below.

This is the document founders send around internally before they buy. It is written from the build side, with the awkward questions made explicit. Regulatory dates are current as of mid-2026 and hedged where they are moving.

Want this scope pressure-tested before you sign with an agency?

Book Free Consultation

Why "it works" does not work for AI

In normal software, acceptance is binary: given input X, assert output Y, done. An LLM gives different output for the same input across runs, so "the chatbot answers accurately" has no test behind it. There is no number, no defined set, no floor, nothing to sign off and nothing to dispute when quality is poor. The fix is statistical, not binary. You accept a measured rate on a defined set, ideally averaged across several runs, and for reproducible regression checks you pin the temperature to zero where the use case allows. The second trap follows immediately: a fix for one case silently breaks another, so the eval set has to run as a gate on every prompt or model change, not once at the end. That single shift, from "it works" to "it clears this bar on this set, every time," is the whole point of scoping an AI build well.

What an AI MVP statement of work must pin down

Each section earns its place. The load-bearing ones are the eval set, the acceptance criteria, and the launch gate.

Section	What it must pin down
Problem + one outcome	One sentence. The single user job the MVP must do. Everything not serving it is out of scope.
In and out of scope	Two lists. The out-of-scope list is the load-bearing one: name the tempting things you are not building (multi-language, voice, fine-tuning, mobile) so they become change requests, not assumptions.
Functional + AI behavior spec	Normal requirements plus the AI behavior: task, tone, refusal behavior (when it must say "I do not know"), citation requirement, and fallback on low confidence or no retrieval hit. This is where you encode the uncertain case.
Acceptance criteria	Target metric plus threshold on a named eval set. Never "it works." Examples below.
The eval set	How many cases (around 100 is a workable MVP floor, ~200 for a fuller set; ten tells you almost nothing), who owns the golden answers (a domain expert on your side, not the vendor alone), and how each case is scored: deterministic checks for objective fields, an LLM-as-judge with binary pass-fail for nuanced quality, human review to calibrate.
Launch gate + rollback	The measurable bar to go live, agreed by product, engineering, and QA before tests are written, plus an auto-rollback trigger and a kill criterion.
Data	Sources, provenance, rights to use them (training and retrieval), PII handling, and residency. The model API that sees prompts is a sub-processor: it needs a DPA and a no-training, zero-retention config, not consumer terms.
Non-functionals	Latency budget (for streaming UIs, time-to-first-token is the primary SLA), cost-per-action budget (output tokens cost several times input, and agentic flows fan one action into many calls), and logging of every model call.
Security and compliance	Auth, permission model and tenant isolation, GDPR (record of processing, lawful basis, DPIA for high-risk, sub-processor DPAs), and EU AI Act transparency where the app talks to users.
Milestones, payment, IP, handover	Phased discovery, build, eval and hardening, launch, with payment per phase. IP assigned to you on payment. A named handover artifact list.

Acceptance criteria: wrong versus right

This is the section that decides whether you can hold a vendor to anything.

Wrong, because there is no number, no set, and no floor: "the chatbot answers customer questions correctly," "the assistant is accurate and helpful," "the model rarely hallucinates," "it works well in testing."

Right, on the agreed 100-case eval set with fixed golden answers:

at least 90% faithful answers, grounded in retrieved context, scored by an LLM-as-judge on a binary pass-fail and calibrated against human labels;
at least 85% answer relevance;
under 2% harmful or policy-violating outputs, as a hard gate where any single harmful output blocks launch;
refuses or escalates on at least 95% of the unanswerable test cases;
p95 time-to-first-token under 2 seconds, p95 full response under 4 seconds;
cost under EUR 0.05 per conversation at the agreed model;
no regression below any floor on the eval run before a deploy.

Treat the exact thresholds as starting points to negotiate against your risk, not gospel. Vendors commonly cite faithfulness at or above 0.75 and hallucination under 5% (1% for high-stakes) as production starting points; set yours by how much a wrong answer costs you. The eval set you scope here is the same evidence an investor's technical diligence will ask for, which we cover in the technical due diligence checklist for AI MVPs, and the scoring methods are in when LLM evals are worth building.

The copy-ready scope template

Paste it, fill the brackets, delete what does not apply. Send it around before you take a single proposal.

AI MVP Scope / SoW, [Project Name]
Date [date] · Version [v0.1] · Owner [name]

1. Problem + one outcome. Problem: [one sentence]. The one outcome this MVP must deliver: [user] can [do X] so that [Y].

2. Scope. In scope: [feature 1], [feature 2]. Out of scope, change request only: [multi-language], [voice], [fine-tuning], [mobile].

3. Functional + AI behavior. Functional: [list]. AI behavior: task [exactly what], tone [concise, no speculation], citations [must or must not ground in sources], refusal [when out of scope or low confidence, say "I do not know" or escalate], fallback [secondary model, cached answer, or human handoff].

4. Acceptance criteria. On the agreed [100]-case eval set: faithfulness >= [90]%; relevance >= [85]%; harmful outputs < [2]% (any single fail blocks launch); correct refusal >= [95]% on unanswerable cases; p95 time-to-first-token < [2]s; p95 full response < [4]s; cost per conversation < [EUR 0.05] at model [X]; no regression below any floor before deploy.

5. Eval set. Size [100] cases ([X] happy path, [Y] edge, [Z] unanswerable). Golden-answer owner: [client domain expert]. Scoring: deterministic checks for [objective fields], LLM-as-judge pass-fail for [quality], human review for calibration. Stored and versioned in [location].

6. Launch gate + rollback. Go live when all Section 4 floors are met, signed off by [product] + [engineering] + [QA]. Rollback: if [metric] drops below [floor] over a [window], auto-revert. Kill criterion: do not ship if [faithfulness < X% or any harmful output].

7. Data. Sources [list], provenance and rights [per source], PII [what and how handled], residency [EU region]. Model provider: DPA signed, no-training, zero-retention.

8. Non-functionals. Latency and cost budgets as in Section 4. Observability: log every model call (prompt, response, tokens, cost) to [tool], daily cost alert at [threshold].

9. Security + compliance. Auth [method], permissions and tenant isolation [model], GDPR (record of processing, lawful basis, DPIA if high-risk, sub-processor DPAs), EU AI Act Article 50 transparency from 2 August 2026 if the app talks to users, with no plan that assumes an unenacted delay.

10. Milestones + payment. Discovery, build, eval and hardening, launch, payment per phase. IP assigned to [client] on payment. Handover: eval set and results, prompt and model registry, architecture and data-flow diagram, runbook, logs access, credentials. Sign-off: client [__] vendor [__] date [__].

"If the scope cannot tell you, in numbers, what good enough looks like and what happens when the model is wrong, it is not a scope. It is a wish. The eval set and the launch gate are the two lines that turn an AI demo into something you can actually buy."

A note on the AI Act

If your app interacts with users, the EU AI Act's Article 50 transparency duties, including telling users they are dealing with AI, apply from 2 August 2026 and are largely unaffected by the proposed changes. Most high-risk obligations were also due then. A provisional agreement in 2026 would push the standalone high-risk duties to late 2027, but as of mid-2026 that is not yet law and takes effect only on formal adoption. Do not scope your compliance plan around a delay that has not been enacted.

Frequently Asked Questions

How do you write acceptance criteria for an AI feature?

Not as "it works." You define a target metric and a threshold on a named eval set, for example at least 90% faithful answers on a 100-case set, under 2% harmful outputs, p95 latency under 4 seconds, and cost under EUR 0.05 per conversation. Because the system is probabilistic, you accept a measured rate across runs, not a single pass-fail.

What is an eval set?

A versioned, owned set of representative input cases with agreed golden answers, used to measure quality objectively and to catch regressions on every prompt or model change. Around 100 cases is a workable MVP floor, about 200 for a fuller set. Ten cases tells you almost nothing.

Who should own the eval set?

A domain expert on your side, not the vendor alone. The person who knows what a correct answer looks like must define the golden answers, or you are letting the builder grade their own homework.

How are eval cases scored?

Three ways, often combined: deterministic code checks for objective fields like dates, IDs, and JSON shape; an LLM-as-judge for nuanced quality, using binary pass-fail rather than a 1 to 5 scale; and human review to build and calibrate the judge.

What is a launch gate for an LLM app?

The measurable bar that decides go-live: the thresholds the system must hit on the eval set, agreed by product, engineering, and QA before any tests are written. Below the bar you do not ship. It also defines a rollback trigger and a kill criterion.

What belongs in an AI statement of work?

Problem and one outcome, in and out of scope, functional requirements plus an AI behavior spec (tone, refusal, citations, fallback), acceptance criteria as metric plus threshold on an eval set, the eval set itself, the launch gate and rollback, data rights and PII and residency, non-functionals (latency, cost-per-action, logging), security and compliance, and milestones, payment, IP, and handover artifacts.

Why can I not just write "the AI answers correctly" in the scope?

Because the model is non-deterministic. The same question can produce different answers, so "correctly" has no test behind it. You need a defined set, a metric, and a floor, or there is nothing to sign off and nothing to dispute when quality is poor.

How do I handle the case where the AI is wrong or unsure?

Specify it in the scope: a refusal or "I do not know" path, escalation to a human, or a fallback. Then test it by including unanswerable cases in the eval set and requiring the system to refuse or escalate on, say, at least 95% of them.

What data clauses does an AI MVP SoW need?

Provenance and usage rights for training and retrieval data, PII handling, EU data residency where relevant, and a DPA with the model provider configured for no training and zero retention. The model API that sees user prompts is a sub-processor.

Does the EU AI Act affect my AI MVP scope?

If your app interacts with users, Article 50 transparency duties apply from 2 August 2026. Most high-risk obligations were also due then; a proposed delay would move standalone high-risk duties to late 2027, but that is not yet law, so do not scope around it.

Final thoughts

An AI MVP scope is only as good as its eval set and its launch gate. Those two lines turn a vague "build us an assistant" into something a vendor can be held to and an investor will later respect.

So before you take a proposal, write the one outcome, draw the out-of-scope line, define the metric and the floor on a set your own expert owns, decide what happens when the model is unsure, and name the bar that lets it go live. The template above is the starting point. Fill it in first, and the proposals you get back will be about the same thing, which is the only way to compare them.

Want the eval set and launch gate built into your MVP scope?