QA for AI-Generated Code: What Breaks Before Launch and How to Catch It

An AI prototype from Lovable, Cursor, Claude Code, or Replit gets you to a working demo in a weekend. It does not get you to production. The gap between "it works on my screen" and "it survives real users, real load, and a security review" is where AI-generated code quietly fails. This is the QA process we run on AI-assisted builds before they go live, and the failure modes we see most.

None of this is an argument against building with AI. We build with it too. It is an argument for testing the output the same way you would test any code that is about to touch real money, real data, and real users.

Shipped an AI prototype?

Book a Production-Readiness Review

Why does AI-generated code break in production?

AI coding tools optimise for one thing: producing something that runs and matches the prompt. They do not optimise for the things that decide whether software survives contact with users. The model has no view of your threat model, your data volumes, your edge cases, or your compliance obligations. It writes the happy path well and skips almost everything else, because nobody asked.

The result is code that demos cleanly and breaks predictably. The breakages are not random. They cluster in the same places every time, which is what makes them testable.

What actually breaks in AI-generated code?

Here is the list we work through on every AI-assisted build, ordered by how often it bites.

Authentication and authorization gaps. The login screen works. The check that stops user A from reading user B's data is missing or applied on the frontend only. This is the single most common serious defect we find.
Input that is never validated. Forms accept anything. No length limits, no type checks, no sanitisation. The demo data was clean, so the gap never showed.
Secrets in the wrong place. API keys, database URLs, and tokens hardcoded in client-side code or committed to the repo. AI tools paste them inline because it makes the example run.
No error handling. The happy path is covered. A failed network call, a timeout, or an empty result throws an unhandled exception and the screen goes blank.
Queries that do not scale. Code that loops a database call inside a render, or pulls a whole table to count rows. Fine with 10 records, fatal with 100,000.
Race conditions and double submits. Two clicks create two orders. Two parallel requests both pass a balance check and both withdraw.
Dependencies nobody vetted. The model pulls in packages that are outdated, abandoned, or carry known vulnerabilities.
State that lies. The UI says the payment succeeded; the backend never recorded it. Optimistic updates with no reconciliation.

"AI does not write insecure code on purpose. It writes the code you asked for and nothing you forgot to ask for. Production is the sum of everything you forgot to ask for."

The production-readiness checklist for AI-assisted builds

This is the structure of a Wavect review. You can run a first pass yourself before you call anyone.

Authorization audit. For every endpoint and every data read, confirm the server checks who is asking and whether they are allowed. Frontend checks do not count.
Input boundary test. Throw malformed, oversized, and hostile input at every entry point. Confirm it is rejected cleanly, not absorbed.
Secret sweep. Scan the repo and the client bundle for keys, tokens, and credentials. Rotate anything that leaked and move it server-side.
Failure-path coverage. Force every external call to fail and confirm the app degrades gracefully instead of crashing.
Load and query review. Profile the database under realistic data volumes. Kill N+1 queries and unbounded reads before they kill you.
Concurrency test. Fire parallel and duplicate requests at anything that writes money or state. Add idempotency where it is missing.
Dependency and licence scan. Check every package for known vulnerabilities and incompatible licences.
Regression suite. Write the tests the prototype never had, so the next AI-assisted change does not silently break what already works. See test-driven development for why this matters more, not less, when AI is writing the code.

This is the core of our software QA service. The deliverable is not a PDF of complaints. It is a fixed, tested codebase and the test suite that keeps it fixed.

Can I just ask the AI to fix its own code?

Partly. An AI tool will happily add a validation check or wrap a call in error handling once you point at the spot. What it cannot do is decide where to look. It has no model of your technical debt, no memory of the order in which things were built, and no instinct for the edge case a real user will hit on day two. Finding the gaps is human work. Closing them, increasingly, is shared work. That split is exactly how we run these engagements.

How long does it take to make AI-generated code production-ready?

For a typical vibe-coded MVP, a focused review and hardening pass runs one to three weeks. The variance is driven by two things: how much real money or sensitive data the product touches, and how far the AI ran without supervision. A weekend prototype that handles payments and personal data needs more than a weekend of QA. A read-only internal tool needs far less. We scope it after a first look, not before.

When is the code beyond saving?

Rarely, but it happens. If the data model is fundamentally wrong, or the same broken pattern is copied across a hundred files, rebuilding the core is cheaper than patching it. We will tell you that on the first call rather than bill you for a month of patching a foundation that needs to be poured again. Honesty here is cheaper for everyone.

Final thoughts

AI-generated code is not worse code. It is unreviewed code. The prototype that took a weekend skipped the same weeks of hardening that every production system needs, and the bill for those weeks does not disappear because a model wrote the first draft. It just moves to launch day, when it is most expensive.

Run the checklist before you put real users in front of an AI-assisted build. If the authorization, input, and failure-path sections make you nervous, that is the signal to get a second set of eyes on it before launch, not after the incident.