Kevin Riedl

7 min read · 26 May 2026

Focus Is the New Bottleneck: Why You Can Only Run as Many AI Agents as You Can Control

Had an interesting call on Sunday. One line stuck. Focus is the new bottleneck since LLMs got good. Not writing code. Not fixing bugs. Controlling orchestration. Past a certain agent count, output quality drops, and the work shifts from typing to deciding which agent runs, with what context, against which checks. The new skill set is focus, context management, and QA. After a year of running real client engagements where most of the keyboard time is now agent supervision, I agree.

This post is about why the ceiling exists, what it looks like in practice, and how we keep the agent count below the ceiling on actual Wavect builds.

Building with AI agents?

 Book Free Consultation

What actually changed when LLMs got good?

For most of software history, the bottleneck was throughput. How fast can a senior engineer turn a spec into working code. Tools were measured by how much they removed from that loop. Autocomplete, snippets, IDE refactoring, then Copilot, then full coding agents.

By 2026, that loop is mostly gone for everything except the hardest 10% of work. A competent operator running a coding agent can produce more code in a day than a team of three could a few years back. The constraint moved.

The new constraint is not "can the agent write the code". It is "do I have enough attention left to verify what the agent produced, route the next task to the right agent, and keep the context window of each agent clean enough that the output stays honest". That is a focus problem, not a typing problem.

What does the orchestration ceiling actually look like?

Anyone who has run more than two agents in parallel has hit a moment that feels like this. Agent A produces code that looks fine. Agent B produces a refactor that conflicts with Agent A. Agent C suggests a test that catches neither. You spend more time reconciling them than you saved by parallelizing. You add a fourth agent to triage the conflicts. It introduces new ones.

That is the ceiling. It is not a hard number. It moves based on three variables.

  • Task coupling. Independent tasks scale almost linearly. Coupled tasks where each agent's output depends on another agent's output collapse fast.
  • Verification cost. If each agent's output takes 30 seconds to validate, 10 agents are fine. If it takes 15 minutes, 3 is your ceiling.
  • Context hygiene. The longer the conversation, the worse the output. Past a threshold, every agent silently degrades. You stop noticing the drift until a regression ships.

Why does output quality degrade past N agents?

From engagements where we have run more than two coding agents in parallel, the failure modes cluster into seven categories. Listed in order of frequency.

  1. Attention split. The operator switches between agents fast enough that none get a full read. Subtle errors land in production. The cheap fix is a hard cap on concurrent agents, usually two for new operators and three to four for experienced ones.
  2. Context bleed. An agent that ran for an hour on Task A now picks up Task B with stale assumptions baked in. The output looks confident and is wrong. Cheap fix: hard reset between tasks. Treat the context window as a tool, not a memory.
  3. Eval debt. The operator runs more agents than the eval suite can keep up with. Quality drops without anyone noticing. The fix is to invest in TDD and continuous evals before scaling agent count, not after.
  4. Conflict reconciliation tax. Two agents touch the same module. Reconciling their outputs costs more than the work either produced alone. Cheap fix: partition the codebase by agent, not by task.
  5. Tool-use latency stack-up. Each agent waits on tools. Three agents waiting on the same database fixture serialize on the slowest. Cheap fix: per-agent fixtures, or fewer agents.
  6. Hallucinated coordination. One agent thinks another agent already did the work, because it can see the message in the shared history. Nothing was done. Cheap fix: explicit handoff state, not assumed coordination.
  7. Reviewer fatigue. The operator stops reading carefully past 90 minutes of agent supervision. The fix is shorter sessions, not more agents.

What are the three new skills the orchestrator needs?

If the typing loop is mostly automated, the bottleneck is the operator's attention. Three skills decide whether the operator scales or stalls.

1. Focus discipline. Knowing how many agents you can supervise without quality drop. Treating that number as a hard constraint, not a target to beat. Most operators we have hired hit their ceiling at three concurrent agents. A few sit comfortably at five. None we have seen run ten without quality dropping.

2. Context management. Knowing when to reset an agent, when to summarize a long conversation, when to spin a fresh context for the same task, and when to keep history. This is the part of the job that did not exist three years ago. The MCP ecosystem is starting to ship structured context handoffs, which helps, but the operator still has to choose what to keep.

3. Quality assurance design. If the operator cannot read every output, the eval suite has to. QA stops being a phase and becomes the loop. Tests, snapshot checks, regression suites, behavioral evals, smoke runs after every agent commit. The more agents you run, the more your QA stack carries the weight.

Kevin Riedl

"The ceiling is not a tool problem. It is an attention problem. Buying a better agent does not raise it. Investing in evals does."

How do we ration agent count on actual engagements?

On every AI engagement we run at Wavect, the operator-to-agent ratio is set up front and adjusted weekly. The rules we keep coming back to.

  • Two concurrent agents per operator, default. Three if the operator has shipped at least one AI-heavy build before. Never more than five without a co-operator handling reviews.
  • One specialist agent per role, not per task. A coding agent, a testing agent, a planning agent. Each one keeps a stable role rather than churning through every task type.
  • Hard context reset between epics. When the work moves to a new feature, the agents start fresh, even if their previous context "almost fits".
  • Evals before parallelism. Until the test suite catches regressions reliably, we run sequentially. Parallelism without evals is a sandcastle.
  • Reviewer hours capped. No operator runs agent supervision more than 4 hours per day. The remaining time goes to architecture, eval design, and review.

None of this is groundbreaking. It mirrors how teams operate without AI. The interesting bit is that it now applies to a single person running a swarm of agents.

How does this change what we hire for?

It moves the hiring bar. The most valuable engineer in 2026 is not the fastest typist. It is the one who can keep five agents productive without their output quality drifting. That is a focus discipline, a context-management discipline, and a QA discipline rolled together. We can train each of those. We cannot fully train the ability to notice that an agent has started to lie about what it did. That part comes from experience.

This is also why the fractional CTO role is changing. A few years back the value was technical judgement and shipping speed. Today, half of the value is calibrating an early team's agent-supervision capacity and building the eval scaffolding that lets that capacity scale safely. We see this in nearly every AI-heavy engagement.

Founder Q&A

Does this mean small teams beat big teams now? Small teams with strong eval scaffolding beat big teams with weak ones. Headcount stops being the proxy. Supervision capacity does.

Are coding agents going to make this easier? Better coding agents raise individual output quality but not supervision capacity. The ceiling moves slowly because attention is the binding constraint.

What about agent-on-agent supervision? Some teams ship "reviewer agents" that check coding agents. They help at the margins. They do not solve the focus problem because someone still has to supervise the reviewer. Layers do not eliminate the operator's attention budget, they spend it differently.

How do you know an agent has started to drift? Three tells. The output is more confident than the context warrants. The agent stops asking clarifying questions. Tests that used to fail now pass without code changes. Any one of those is the cue to reset context.

Is this just engineering, or does it apply to non-code work too? It applies everywhere agents touch. We see the same pattern in support automation, in research workflows, and in RAG pipelines that route between specialist agents. The numbers shift, the shape is the same.

Final thoughts

The Sunday-call observation is right. The bottleneck moved from typing to focus, and most teams are still measuring themselves on the old constraint. Watching how many agents your team can supervise without quality drop is now a leading indicator. Investing in evals and context discipline raises the ceiling. Buying more agents does not.

If you are scaling an AI-heavy team in 2026 and your output quality is fraying, the move is not more agents. It is fewer agents per operator, stronger evals, and shorter supervision sessions. Boring. Effective.

What do you think the ceiling is for your team. Tell us, we want to compare notes.

Scaling an AI team?

 Book Free Consultation
Kevin Riedl

7 min read · 26 May 2026