Technical Due Diligence Checklist for AI MVPs Before Funding
Technical due diligence on an AI MVP examines the same layers as any software review (code, infrastructure, security, team) plus a set of AI-specific checks a generalist misses: do you have an evaluation set and regression evals, are prompts and models versioned, do you log every model call, what happens when the model fails, what does an inference actually cost, and do you have the rights to the data you train or retrieve on. The single thing that separates a fundable AI MVP from a demo is evidence. Investors increasingly treat a private, versioned eval suite as the proof your AI works. "We test it by hand" fails that bar. This is the checklist to run on yourself before they run it on you.
This is an engineering view aimed at founders, with the investor's questions made explicit. Regulatory dates are current as of mid-2026; one in particular is a trap if you plan around a delay that has not happened, flagged below.
Want an independent technical DD pass before your raise?
Book Free ConsultationWhy evidence, not a demo
Two independent findings set the bar. A Stanford study of purpose-built legal AI tools, the kind sold as accurate, still measured hallucination on more than 17 percent of benchmark queries for some products and more than 34 percent for others. And an MIT-affiliated report widely cited in 2025 found that around 95 percent of enterprise generative-AI pilots delivered no measurable bottom-line impact. The lesson for a founder raising money is blunt: a working demo proves almost nothing, and the investor knows it. What moves a round is measured evidence that your system works, does not regress, and is economically and legally sound at scale.
The AI-specific checks a generalist misses
This is the core of the post and the part a generic software review skips. For each: what to check, why it matters, and the red flag.
- An evaluation set. A versioned golden dataset plus a scoring rubric. Unit tests tell you green or red; they cannot tell you whether an answer was correct or faithful. Red flag: "we eyeball outputs," no golden set, no numbers.
- Regression evals as a CI gate. The eval suite runs on every prompt or model change before deploy. The same prompt gives different output when the model version or input shifts, and a fix for one case silently breaks another. Red flag: prompt changes ship straight to production.
- Model-call observability. Tracing of every model call, with token and cost accounting and the prompt and response captured. You cannot debug a bad answer you cannot reconstruct. Red flag: "we use the provider dashboard" as the whole story.
- Prompt and model versioning. Prompts are versioned artifacts and the model is pinned, not called as "latest" which auto-upgrades under you. Red flag: prompts hardcoded inline, model aliased to latest.
- A fallback when the model fails. Retries, a secondary model or provider, graceful degradation. Your uptime is now bounded by a third-party API. Red flag: one provider, one model, no timeout or degraded path, so one vendor outage is a full outage.
- Unit economics per inference. Cost modeled per call, then per action, then into gross margin. Agentic flows fan one action into hundreds of calls. Red flag: no cost-per-action metric and a margin assumed to be "SaaS-like."
- Rights to the training and retrieval data. Documented provenance and a license or permission per source. The question is no longer "is it fair use" but "can you prove where every datum came from and that it was lawfully obtained." Red flag: scraped data of unknown origin, a RAG corpus with no usage rights.
- A measured hallucination rate plus guardrails. An error rate on a domain benchmark, plus retrieval grounding and output validation. Red flag: no measured rate and "RAG fixes hallucinations" stated as if solved.
- Model choice and lock-in. A rationale for proprietary API versus open weights, and an abstraction layer that lets you swap providers. Red flag: hard-coupled to one provider's SDK with economics that only work at today's subsidized price.
The handover artifacts a fundable AI MVP has ready
If these exist, diligence is fast and your valuation holds. If they live only in a founder's head, every gap becomes a discount.
| Artifact | Why diligence cares | Red flag if missing |
|---|---|---|
| Architecture diagram (dated, names external deps) | Tests whether it handles 10x and reveals key-person risk | Architecture lives only in a founder's head |
| Data-flow map (follows the data, not the services) | Shows which third parties touch what data; GDPR exposure | Unknown privacy exposure the investor inherits |
| Eval reports (versioned harness, results per model and prompt) | How a claimed AI moat is verified instead of taken on faith | No objective evidence the model works or will not regress |
| Model and prompt registry | Reproducibility and rollback of any output | Production behavior cannot be reproduced |
| Runbook and incident response | Lowers key-person dependency, base compliance evidence | Unmeasured downtime risk |
| SBOM (SPDX or CycloneDX, regenerated in CI) | Surfaces copyleft contamination and unpatched CVEs | Unknown license and vulnerability exposure |
| IP chain of title (founder and contractor assignments) | The classic deal-killer; paying an invoice does not transfer IP | A departed contributor who never assigned a core module |
| Security report (recent pen test, SOC 2 or ISO 27001 if applicable) | Baseline in 2026, and it unblocks enterprise sales | Unknown breach exposure |
Data, privacy, and provenance
For an EU AI MVP this is where deals get repriced. Diligence checks your record of processing activities (GDPR Article 30), a lawful basis for training on personal data (Articles 6 and 9, with a legitimate-interest assessment on file), a data protection impact assessment before high-risk processing (Article 35), and data processing agreements with sub-processors. Note one thing founders miss: a model API that ingests your users' prompts is a sub-processor, so it needs a DPA and a no-training, zero-retention configuration, not consumer terms. The EDPB's Opinion 28/2024 also warns that a model trained on personal data is not automatically anonymous, so unlawful training data can taint the deployed product. On the EU AI Act, the live binding date for most high-risk and transparency obligations is 2 August 2026. A proposal to delay it was circulating in 2026 but is not enacted, and a compliance plan that banks on the delay is itself a red flag.
What investors actually flag
From the investor and acquirer side, and these sources are interested parties so weigh them as such, the recurring flags are: a thin wrapper on a single model with no workflow depth; a weak moat (the durable ones now are proprietary or permissioned data, integrations, and persistent context, not the base model); gross margin after inference cost, since inference is a real variable cost that breaks the SaaS-margin assumption; fragile retention when switching costs are low; and, increasingly, the absence of private continuous evals. On acquisition specifically, expect retention covenants on key AI engineers and indemnities tied to data-provenance representations. The vibe-coded angle of this, security, IP ownership, and what an acquirer checks in AI-built code, is its own checklist in our Lovable, Bolt, and Replit due diligence post, and the eval discipline that underpins item one and two is in when LLM evals are worth building.

"A demo proves you can get a good answer once. An eval set proves you get good answers consistently and will notice when you stop. Investors stopped being impressed by the first and started asking for the second. That shift is the whole game in AI due diligence."
Frequently Asked Questions
What is technical due diligence for an AI startup?
What do investors check in an AI MVP?
What eval evidence do I need before a raise?
How is AI due diligence different from normal software due diligence?
Do I need an SBOM for due diligence?
What is IP chain of title and why does it kill deals?
How does GDPR affect AI due diligence in the EU?
Does the EU AI Act apply to my MVP yet?
Austria technical due diligence, is anything different?
How do I prove my AI product is not just a GPT wrapper?
Final thoughts
Technical due diligence on an AI MVP is not a generic code review with the word AI added. The layers that decide your round are the AI-specific ones: evals that prove the thing works and will not regress, versioning that makes any output reproducible, honest inference economics, and clean rights to your data.
The good news is that all of it is cheaper to fix before diligence than to explain during it. Build the eval set, pin the models, log the calls, get the IP chain of title signed, and have the artifacts ready in a folder. Do that and diligence becomes a formality. Skip it and every gap turns into a discount on your valuation.
Want the eval set and artifacts in place before you raise?
Book Free Consultation