AI Agent Pilot in 30/60/90 Days: The Production Rollout Plan for Austrian SMEs

TL;DR

A realistic AI agent rollout takes about 90 days. Days 0 to 30 you scope and de-risk: pick one bounded, high-volume workflow, define the success metric and a kill criterion up front, inventory the systems and permissions the agent needs, and stand up logging and an eval set from real cases. Days 31 to 60 you build against a sandbox and run in shadow mode, where the agent proposes and a human approves, while measuring against the eval set and tuning permissions to least privilege. Days 61 to 90 you roll out to a slice of real volume with approval gates, watch cost-per-action and error rate, write a runbook and rollback, and hand ownership to the team. The hard parts are permissions, approval design, evals, logging, and a clean handover, not the model. GDPR applies now; EU AI Act transparency and most high-risk duties apply from 2 August 2026.

A realistic AI agent rollout takes about 90 days. Days 0 to 30 you scope and de-risk: pick one bounded, high-volume workflow, map it, define the success metric and a kill criterion up front, inventory the systems and permissions the agent needs, and stand up logging and an eval set from real cases. Days 31 to 60 you build against a sandbox and run in shadow mode, where the agent proposes and a human approves, while you measure against the eval set and tune permissions to least privilege. Days 61 to 90 you roll out to a slice of real volume with approval gates, watch cost-per-action and error rate, write a runbook and rollback, hand ownership to the team, and decide expand, iterate, or kill. The hard parts are permissions, approval design, evals, logging, and a clean handover, not the model.

This is the how, written from doing it. For the why these projects die, our companion piece on why AI agent projects get cancelled covers the failure clusters; this one is the plan that avoids them. Regulatory dates are mid-2026 and hedged where they are moving.

Want a 90-day agent rollout planned and de-risked with you?

Book Free Consultation

First, is an agent even the right tool?

An AI agent is a system where the model decides its own steps and takes actions against your systems through tools, working multi-step toward a goal with limited human input. Take away any one of those and it collapses into something simpler and usually better. Most failed "agent" projects should have been a RAG assistant or a coded workflow. Pick the cheapest tool that does the job.

What you need	Right tool
Single-turn answers from a knowledge base, cost-predictable, easy to audit	RAG assistant, not an agent
Fixed, rule-based, predictable steps on structured data	RPA or a coded workflow, not an agent
Conversational Q and A with no actions against systems	A chatbot, not an agent
Open-ended goal, unpredictable step count, must take actions across systems via tools	An AI agent, with guardrails, and accept the higher cost and compounding-error risk

The rule of thumb: reserve agents for workflows where extra reasoning changes the business outcome. If the steps never vary, an agent is the more expensive, less auditable wrong tool.

Days 0 to 30: scope and de-risk

The whole pilot is won or lost here. Pick one workflow with high volume and clear boundaries, and write down, before anything is built: the single number you are trying to move, the pre-tool baseline (start measuring now), and a kill criterion (for example, stop if adoption is below a set bar by week four, or if the data is too dirty, or the impact is too small). Inventory every system and permission the agent will touch and plan least privilege from the start. Stand up logging and observability, and build a small eval set from real cases, 20 to 50 tasks drawn from real failures is a great start. Decide which actions need a human-approval gate, especially anything irreversible.

Days 31 to 60: build and run in shadow

Build against a sandbox, never live systems. Then run in shadow mode: the agent processes the same real inputs as your team and logs what it would do, but humans stay the final decision-makers, so you measure its judgment before it touches anything. Use a ladder of autonomy, supervised first, then exception-only or sampled approvals once the metrics earn it. Score lightly at day 30 and day 60 against the eval set so the day-90 decision is a confirmation, not a surprise. Tune permissions down to least privilege, and red-team the failure modes deliberately: prompt injection, unsafe tool calls, and the ambiguous real-world request that never appears in a demo. Designing for messy input, not the happy path, is usually what separates a pilot that ships from one that does not.

Days 61 to 90: limited production and handover

Roll out to a slice of real volume with the approval gates still on, and start with an audit-first posture, observe behavior, then tighten controls. Monitor cost-per-action and error rate, and enforce hard token and cost budgets at the infrastructure layer before each call, not in a report afterwards. Write the runbook and the rollback: define the trigger that auto-reverts to the previous version if a metric regresses. Then do the part most teams skip: hand ownership to the team. Decision authority, who can change the agent and who is accountable, must be defined before wider rollout, and the team has to be able to read the traces and run the runbook without the people who built it. Finally, make the call against your day-0 metric and kill criterion: expand, iterate, or stop.

The hard parts, and how to get them right

Permissions and least privilege. OWASP's "excessive agency" risk traces to excessive functionality, permissions, and autonomy. Give the agent task-scoped, time-bound, least-privilege access and its own identity, so you can enforce least privilege and reconstruct what happened after an incident.
Human approval design. The pattern is propose then approve: the agent pauses on a high-impact or irreversible action and a human approves, edits, or rejects it with full context. You do not need to approve every action, but you do need to gate the ones that can cause damage.
Evals and regression. Three layers: deterministic per-step checks, production sampling to catch drift, and periodic human review to calibrate. Testing an agent means testing its judgment, not just one output.
Logging and audit trail. Trace every model call, tool invocation, and decision. Without it you cannot debug, improve, or prove what the agent did, and under GDPR you have to be able to prove it.
Cost-per-action and fallback. Agentic flows can cost several times more per task than a chatbot because context is re-sent on every step. Track cost per outcome from day one, route cheap steps to small models, and define what happens when a tool or the model fails.
Clean handover. An agent only your vendor understands is a liability, not a win. The team must own it.

"The model is the easy part now. The 90 days are about permissions, approval gates, evals, and a clean handover. Shadow mode is the single highest-leverage step: let the agent prove its judgment on real inputs while a human still holds the wheel, and the go-live decision makes itself."

Why so many agent projects fail

Gartner forecasts that over 40 percent of agentic AI projects will be cancelled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The failures cluster into recognisable shapes: hallucination, latency, eval-debt, runaway cost-per-action, missing handoff, dirty data, scope greed, and audit gaps. The 90-day plan above is built to surface each of those early, in the first 30 days where they are cheap to fix, instead of the sixth month where they kill the project. We break the clusters down in why AI agent projects get cancelled, and the orchestration skills behind running several agents well in focus is the bottleneck.

The EU and Austria part

An agent that acts on personal data lands squarely in GDPR. You must keep an audit trail (the accountability principle means you have to demonstrate what happened), apply data minimisation and least privilege, and provide meaningful human oversight for any significant automated decision, not a symbolic rubber stamp. You also need a signed data processing agreement with every model and cloud provider before personal data flows to them, and US providers carry a residual transfer risk even with EU residency. The Austrian data protection authority treats you, the deploying company, as the controller, so the responsibility is yours. On the EU AI Act, transparency duties under Article 50, including telling people they are dealing with an AI, apply from 2 August 2026, as do most high-risk obligations. A proposed Digital Omnibus that would defer some high-risk deadlines was provisionally agreed in 2026 but is not yet law, so plan against the 2 August 2026 date.

Frequently Asked Questions

How long does it take to roll out an AI agent?

Plan about 90 days: 30 to scope and de-risk one workflow, 30 to build and run in shadow mode, and 30 for limited production and handover. Score at days 30 and 60 so the day-90 expand, iterate, or kill decision is no surprise.

What is human-in-the-loop approval?

The agent proposes an action and a human approves, edits, or rejects it before any side effect. Modern agent frameworks pause the run and surface full context for high-impact or irreversible actions, so you gate the dangerous ones without approving every step.

How do I stop an AI agent from doing damage?

Least-privilege, task-scoped, time-bound permissions; human-approval gates on irreversible actions; a sandbox plus red-teaming before production; hard cost and step budgets enforced before each call; and a defined rollback and runbook.

Do I even need an agent?

Only if the workflow has an open-ended goal, an unpredictable number of steps, and must take actions across systems. Otherwise use RAG for question answering or a coded workflow for fixed steps, both cheaper and more auditable.

What does an AI agent cost to run?

More than a chatbot. Agentic flows re-send context on every step, so per-task cost can be several times higher. Track cost-per-action from day one and enforce budgets at the infrastructure layer rather than discovering the bill later.

What is shadow mode?

The agent runs in parallel with the human process on the same inputs and logs what it would do, while humans stay the final decision-makers. You measure its accuracy and judgment before granting it any real control.

What is a kill criterion and why set it first?

A pre-agreed threshold, such as adoption below a set bar by week four, that triggers stopping. Defining it on day 0 prevents sunk-cost drift, which matters given that a large share of agent projects are forecast to be cancelled.

What are evals and why build them before the agent?

A set of real-case tasks you score the agent against, starting with 20 to 50 drawn from real failures. Writing the evals first, then building to pass them, is how you detect regressions instead of shipping them.

AI agents in Austria, what is the legal position?

GDPR applies in full now: audit trail, data minimisation, meaningful human oversight for significant automated decisions, and a data processing agreement with your model provider. EU AI Act transparency and most high-risk duties apply from 2 August 2026; a proposed delay is not yet law.

How do I hand the agent over so my team owns it?

Define decision authority and accountability before wider rollout, document a runbook, and make sure the team can read the traces and operate the agent without the people who built it. An agent only the vendor understands is a liability.

Final thoughts

An AI agent rollout is not a model problem, it is an operations problem with a model in it. The 90 days that work are the ones spent on one bounded workflow, least-privilege permissions, an eval set built from real failures, shadow mode before any real control, and a handover that leaves your team owning the thing.

Pick the smallest workflow where reasoning actually changes the outcome, set the metric and the kill criterion on day 0, and let shadow mode earn the agent its autonomy. Do that and you land on the right side of the projects that ship, instead of the 40 percent that get cancelled.

Want the first agent workflow scoped and shadow-tested with you?