The Bottleneck in Software Was Never Intelligence. It Was Context.

TL;DR

Rakuten pointed a coding agent at vLLM and had it implement a specific method, running about seven hours with occasional human guidance and hitting 99.9% numerical accuracy against the reference. The viral "12.5 million lines of code" figure is roughly twenty times too high; the real point is that the agent navigated a large, unfamiliar, multi-language codebase and landed the change in the right place. The lesson is that the bottleneck was never intelligence, it was context: humans tire and make mistakes holding a big system in their heads, and agents do not get that context fatigue. This does not erase engineers, it moves the work. The ones scaling are good at directing agents, feeding them the right context, gating output with tests and guardrails, and reviewing like a senior to catch the confident wrong answer. Delegation became the new deep work.

One story keeps getting passed around: Rakuten handed an AI coding agent a single task inside a large open-source project, walked away, and came back hours later to working code. It is a good story. It is mostly true. And the part everyone repeats, the hours and the accuracy number, is the least interesting part of it. The interesting part is why it worked at all, and what it says about the job of engineering now.

This is an engineering perspective, not a vendor pitch. We direct coding agents on real client work every week, so this is less a hot take than a description of how the work actually changed.

Want help restructuring your team around coding agents?

Book Free Consultation

What actually happened at Rakuten?

Here are the facts, stated carefully, because the viral version inflates one of them. At its Claude Opus 4 launch in May 2025, Anthropic said Rakuten had validated the model with a demanding open-source refactor that ran independently for about seven hours. The detail came later, in Rakuten's own customer story: an engineer pointed Claude Code at vLLM, the open-source inference engine, and asked it to implement a specific activation vector extraction method. The result reportedly hit 99.9% numerical accuracy against the reference implementation.

Two honest corrections, because we would rather be right than dramatic:

It was not unattended. The engineer described providing occasional guidance across those hours. "Ran independently for seven hours" is real and impressive. "Fully autonomous, zero humans" is not what happened.
99.9% is a narrow metric. It is the numerical accuracy of one method's output against a reference, not a general statement that the agent is 99.9% correct at everything. Useful, specific, and easy to misread as a headline.

And the number the internet loves most, that vLLM is "12.5 million lines of code," is the one to be most skeptical of. It appears in Rakuten's own copy, so people quote it in good faith, but it is off by roughly twenty times. The vLLM core is on the order of 84,000 lines of Python, and under 600,000 lines across all languages. You only reach 12.5 million by counting things nobody reasons about: full git history, every related repository, vendored and generated kernels. The real point survives the correction cleanly: Claude navigated a large, multi-language codebase it had never seen, in Python, CUDA, and C++, and landed a change in the right place. That is the impressive thing. It was never the line count.

Why was context the real bottleneck, not intelligence?

Strip the demo down and the lesson is almost boring. A senior engineer who already knew vLLM could have written that method too. A new hire could have, eventually, after weeks of wrong turns. The difference between the senior and the new hire was never raw intelligence. It was how much of the codebase each one could hold in their head and reason about at once.

That is the bottleneck in most software work. Not "can a smart person figure this out," but "can anyone hold enough of this system in working memory to make the right change in the right place without breaking three others." On a large codebase that is genuinely hard, and it is hard in a way that has nothing to do with how clever you are.

What does "AI doesn't get context fatigue" actually mean?

Humans are bad at holding large context for long stretches, and not because we are dim. We tire. We forget the thing we read forty files ago. We take a break and lose the thread. We make small mistakes late in a long session that we would never make in the first hour. Holding a sprawling mental model of a system is exhausting, and exhaustion is where the bugs come from.

A coding agent does not get tired in that way. It can keep the relevant slice of a large codebase in front of it and reason about it consistently at hour seven the same way it did at minute ten. That is the actual unlock in the Rakuten story. Not that the model is smarter than a senior engineer, but that it does not degrade across a long, context-heavy task the way a person does. Sustained context, not superior intelligence.

The catch, and it is a real one, is that the agent only reasons well over the context it is actually given. Point it at the wrong files, or starve it of the constraints that matter, and it will confidently build the wrong thing without fatigue. Feeding it the right context is now the skill. We wrote about the cost side of that discipline in how to cut LLM token costs in 2026: managing context, not just spending tokens, is most of the game.

For a concrete architecture that separates durable memory, shared state, bounded reasoning and intent routing, see our fact-checked Meterless context-layer review.

When cross-file relationships are the missing layer, our Graphify buyer review compares a codebase knowledge graph with repository search and RAG, then gives CTOs a measured two-week adoption test.

When parallel agents make repository setup and change integration the next constraint, use our Git worktrees vs Jujutsu decision guide to choose a cached Git baseline or a measured Jujutsu pilot.

Does this make engineers disappear?

No, and the people predicting that are usually selling something. What it does is move the work. The engineers who are scaling right now are mostly not the ones writing every line by hand. They are the ones who got good at directing agents, reviewing output critically, and knowing the moment an agent is about to do something stupid.

That last skill is underrated. An agent will produce confident, well-formatted, plausible code that is subtly wrong, and a junior reviewer will wave it through because it looks right. Catching that requires exactly the judgment that years of writing code by hand builds. The experience does not become worthless. It changes from "I type the solution" to "I recognize the wrong solution before it ships." If you want a concrete version of what that review looks like, our vibe-code production-readiness checklist is the list we actually run agent output against.

What does good engineering look like when agents write the code?

Delegation became the new deep work. The deep, valuable hours used to be the ones spent heads-down writing the hard function. Increasingly they are the ones spent scoping a task precisely, assembling the right context, and reviewing what comes back with a sharp eye. That is a genuinely different muscle, and a lot of strong engineers have not built it yet because for their whole career the bottleneck was typing the solution, not specifying it.

Good engineering in this mode looks like: a tight, well-defined task; the right context handed to the agent up front; tests and guardrails that catch the wrong answer automatically; and a human who reviews like a senior, not a rubber stamp. The agent is fast and tireless. The engineer is the one who decides what "correct" means and verifies it actually got there.

"The agent does not get tired holding the whole codebase in its head. You do. That is the entire shift. Your job moved from writing every line to deciding what correct means and catching it when the agent gets it wrong."

How should you restructure your workflow around coding agents?

If you want the Rakuten outcome on your own work, the moves that matter, in order:

Scope the task tightly. "Implement this specific method, matching this reference" beats "improve the inference layer." A precise task is what made seven unattended hours possible. A vague one produces seven hours of confident wrong turns.
Give it the right context, not all of it. Point the agent at the files, interfaces, and constraints that actually matter. More context is not better; the correct context is. This is where most of the skill now lives.
Gate with tests and guardrails. The reason Rakuten could trust the output was a reference to check against. Reproduce that: a test suite, a reference, a guardrail that fails loudly when the answer is wrong.
Review like a senior, not a rubber stamp. Read the diff for the subtle, plausible-looking mistakes, the ones that compile and pass a shallow glance. This is the highest-leverage hour you will spend.
Keep ownership and knowledge in-house. An agent that ships code nobody on your team understands is a dependency, not a win. Make sure a human owns and can explain what shipped.

None of this is exotic. It is the same discipline good engineering always needed, reweighted: less time producing the code, much more time specifying and verifying it.

Final thoughts

The Rakuten story is not evidence that AI is smarter than your engineers. It is evidence that the bottleneck was never intelligence. It was context, and the human cost of holding a large system in your head without making tired mistakes. Agents do not get context fatigue, and that is the real shift.

So the job moves. The engineers who scale are the ones who got good at directing agents, feeding them the right context, and catching the confident wrong answer before it ships. Delegation became the new deep work. The number to ignore is 12.5 million lines. The skill to build is knowing exactly what you are asking for, and recognizing when the agent is about to do something stupid.