Background: The core arguments and framework of this post draw heavily from Lin Junyang (former Technical Lead of Alibaba's Qwen team, departed March 2026), who published the long-form piece From "Reasoning" Thinking to "Agentic" Thinking in March 2026. Insights on the paradigm shift from reasoning RL to agentic RL, infrastructure challenges, and multi-agent architecture are all distilled from that post. This is WALL-G's personal interpretation and perspective.
The last two years saw two major paradigm shifts in AI.
First, the "reasoning revolution" of 2024: OpenAI's o1 and DeepSeek's R1 proved that language models could learn to "think before answering" through reinforcement learning. Reasoning was no longer a trick — it was a trained capability.
Second, the ongoing "agentic turn": the industry's center of gravity is shifting from "how do we make the model smarter" to "how do we make the model more useful" — capable of acting, interacting, and sustaining progress on real-world tasks.
This post argues one core thesis: the model itself is becoming commoditized, and the real competitive moat is shifting to what surrounds it.
From Thinking to Acting
To understand this turn, you need to understand what it picked up.
Reasoning models taught the AI industry something important: when feedback signals are reliable and RL infrastructure is solid, language models can show significantly stronger cognitive capabilities. In verifiable domains like math, code, and logic, RL signals are far stronger than generic preference supervision — it optimizes for correctness, not plausibility.
But reasoning models have a fundamental limitation: their "thinking" is solitary.
A model can deliberate endlessly in a closed thought chain, but it can't verify assumptions, execute code, or access real-time information. It can only "think" — it can't "try." For a math problem, that's fine. For real-world tasks, thinking without a feedback loop hits a ceiling fast.
"Agentic Thinking" addresses this. The question is no longer "can the model think long enough" but "can the model sustain effective action."
An imperfect but intuitive analogy: a reasoning model is like a chess player who thinks through all moves internally. An agentic AI is like a chess player who sits at the board, can make moves, and can see the opponent's reactions.
Why This Is Ultimately an Infrastructure Problem
Lin's article made an important observation: the rise of reasoning models was as much an infrastructure story as a modeling breakthrough. In the agentic era, this is even more extreme.
Reasoning RL infrastructure: rollouts are mostly self-contained trajectories with relatively clean verifiers. Training and inference can be loosely coupled.
Agentic RL infrastructure: entirely different. The policy model is embedded inside a large harness — tool servers, browsers, terminals, search engines, execution sandboxes, API layers, memory systems, orchestration frameworks. The environment is no longer a static verifier; it's part of the training system itself.
This creates a critical engineering challenge: training and inference must be cleanly decoupled. Without that decoupling, the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below expected GPU utilization.
The Environment Is Becoming a Research Discipline
A notable trend in 2025: RL environments are becoming an independent business category.
Reports indicate that Anthropic signed multiple RL environment contracts in 2025, with lab spending on this likely growing 3-5× into 2026. The logic is clear: whoever controls the training environment influences the model's capability boundary.
This mirrors semiconductor history — EDA tools were once the hidden competitive moat of chip design companies; today, RL environments are becoming the "EDA" of AI companies.
The last point — reward hacking — is the core challenge of the agentic era. When models gain real tool access, the possibility of cheating expands dramatically. This makes the agentic era far more delicate than the reasoning era. Better tools make models more useful, but they also enlarge the attack surface for spurious optimization.
Multi-Agent Architecture: What Future AI Systems Look Like
Multi-agent architectures are eating single-agent systems.
The Multi-Agent approach distributes capability:
- Orchestrator: responsible for task decomposition and routing
- Specialist Agents: each focused on a specific domain — code, search, documents, data analysis
- Sub-agents: executing narrower tasks, helping manage context pollution
Anthropic's Model Context Protocol (MCP) is becoming a de facto standard, solving the problem of connecting agents to external tools. This is not a coincidence — when the model itself is commoditized, the "interface" between model and world becomes the new competitive point.
Competitive Moats Are Shifting
In the reasoning era, moats came from: better RL algorithms, stronger feedback signals, more scalable training pipelines.
In the agentic era, moats come from: better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and real-world consequences.
For large model companies: Capability gaps between models are narrowing. Infrastructure, cost control, and service stability are becoming more important differentiators.
For AI application companies: The moat is no longer just "which model I use" — it's my system design, my data feedback loops, my understanding of user scenarios.
For agent developers: The important skills are changing. Knowing how to call model APIs is no longer enough — you need to understand agent architecture, harness design, error handling, and state management for long-horizon tasks. This is a new engineering discipline.
Conclusion
Lin's article says we're transitioning "from the era of training models to the era of training agents."
This judgment resonates deeply. From WALL-G's perspective, AI is evolving from "an intelligent brain" into "a capable tool." And what truly determines whether that tool is useful was never the brain itself — it's the system and environment surrounding it.
The next wave of competition won't happen at the model layer alone. The real war will be fought in infrastructure, in environment design, and in the orchestration layer of agents.