It’s not a reasoning problem. New research points to a surprising culprit in why AI gets lost during multi-step tasks.
Have you ever given a smart assistant a multi-step task, only to watch it confidently mess up halfway through? You ask it to summarize an email, then draft a response based on the summary, and finally, add a calendar invite. It nails the summary, but the draft is weird, and the calendar invite is for the wrong day.
It’s a common frustration. These models can write poetry and code, yet they sometimes stumble on what feels like a simple sequence of steps. This leads to a big question: are we hitting a wall with AI development? A fascinating new paper from September 2025 suggests the answer is no, but we’ve been looking at the problem all wrong. The real issue isn’t about reasoning; it’s about LLM task execution.
Small Wins, Huge Gains
First, the paper points out something that feels backward but makes perfect sense when you think about it. Even a tiny improvement in a model’s accuracy on a single step can lead to massive improvements in its ability to complete a long task.
Think of it like building a Lego tower. If you have a 99% chance of placing each brick perfectly, you’ll probably build a decent-sized tower. But if you improve that to 99.9%, you’re not just getting a little better—you’re able to build a tower that’s exponentially taller before a mistake brings it down.
This is a big deal because it means that the continued effort to make models slightly more accurate isn’t a waste. Those small, marginal gains are the key to unlocking the ability to handle much more complex, multi-step problems.
The Real Bottleneck: LLM Task Execution
So, if the models are smart enough, why do they still fail? The researchers argue that we need to separate a model’s ability to reason (to know the plan) from its ability to execute (to follow the plan perfectly).
To test this, they did something clever. They gave the models the complete plan and all the knowledge they needed to solve a long task. They essentially said, “Here are the exact instructions. You don’t have to think, just do.”
The results were revealing. Larger, more advanced models were significantly better at following the instructions step-by-step, even when smaller models were theoretically 100% accurate on a single step. This shows that there’s a distinct skill of LLM task execution that improves as models scale up, independent of their raw reasoning power. It’s the difference between knowing the recipe and actually baking the cake without burning it.
The Self-Conditioning Trap
Here’s where it gets really interesting. The researchers discovered a strange phenomenon they call “self-conditioning.” As a model works through a long task, its own outputs become part of the context for the next step. If it makes a small mistake, it sees that mistake in the context and gets… flustered.
It becomes more likely to make another mistake simply because it’s aware of its prior error.
Imagine you’re assembling furniture and you put one screw in the wrong place. That single mistake can throw you off, making you doubt your next steps and causing you to misread the next instruction. The AI is doing the same thing. It’s not that it forgot the plan; it’s that its own error is now part of the problem it’s trying to solve, which leads it down the wrong path.
Worse, simply making the model bigger doesn’t seem to fix this. It’s a fundamental quirk in how these models operate.
A New Way of “Thinking”
So, is there a way out of this trap? Yes. The paper highlights that newer models designed to “think” sequentially—like those using techniques such as Chain of Thought—don’t suffer from this self-conditioning problem.
Instead of trying to generate a perfect, long answer in one go, these models work step-by-step, almost like a person showing their work on a math problem. By focusing on one correct step at a time, they build a clean, error-free context. This prevents them from getting tripped up by their own mistakes, allowing them to complete much longer and more complex tasks successfully.
This research, available on arXiv, helps settle the debate about why these incredibly powerful models sometimes fail in simple ways. It tells us that the path forward isn’t just about making models that “know” more. It’s about building models that are better doers—that have flawless LLM task execution and can stick to the plan, no matter how long it is. And that’s a crucial step toward creating AI that can reliably handle the complex, real-world challenges we want them to solve.