Former DeepSeeker and collaborators release new method for training reliable AI agents: RAGEN

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


2025 was, by many expert accounts, supposed to be the year of AI agents — task-specific AI implementations powered by leading large language and multimodal models (LLMs) like the kinds offered by OpenAI, Anthropic, Google, and DeepSeek.

But so far, most AI agents remain stuck as experimental pilots in a kind of corporate purgatory, according to a recent poll conducted by VentureBeat on the social network X.

Help may be on the way: a collaborative team from Northwestern University, Microsoft, Stanford, and the University of Washington — including a former DeepSeek researcher named Zihan Wang, currently completing a computer science PhD at Northwestern — has introduced RAGEN, a new system for training and evaluating AI agents that they hope makes them more reliable and less brittle for real-world, enterprise-grade usage.

Unlike static tasks like math solving or code generation, RAGEN focuses on multi-turn, interactive settings where agents must adapt, remember, and reason in the face of uncertainty.

Built on a custom RL framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), the system explores how LLMs can learn through experience rather than memorization. The focus is on entire decision-making trajectories, not just one-step responses.

StarPO operates in two interleaved phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This structure supports a more stable and interpretable learning loop compared to standard policy optimization approaches.

The authors implemented and tested the framework using fine-tuned variants of Alibaba’s Qwen models, including Qwen 1.5 and Qwen 2.5. These models served as the base LLMs for all experiments and were chosen for their open weights and robust instruction-following capabilities. This decision enabled reproducibility and consistent baseline comparisons across symbolic tasks.

Here’s how they did it and what they found:

The Echo trap: how reinforcement learning rewards lead to LLM reasoning loss

Wang summarized the core challenge in a widely shared X thread: Why does your RL training always collapse?

According to the team, LLM agents initially generate symbolic, well-reasoned responses. But over time, RL systems tend to reward shortcuts, leading to repetitive behaviors that degrade overall performance—a pattern they call the “Echo Trap.”

This regression is driven by feedback loops where certain phrases or strategies earn high rewards early on, encouraging overuse and stifling exploration.

Wang notes that the symptoms are measurable: reward variance cliffs, gradient spikes, and disappearing reasoning traces.

RAGEN test environments aren’t exactly enterprise-grade

To study these behaviors in a controlled setting, RAGEN evaluates agents across three symbolic environments:

  • Bandit: A single-turn, stochastic task that tests symbolic risk-reward reasoning.
  • Sokoban: A multi-turn, deterministic puzzle involving irreversible decisions.
  • Frozen Lake: A stochastic, multi-turn task requiring adaptive planning.

Each environment is designed to minimize real-world priors and focus solely on decision-making strategies developed during training.

In the Bandit environment, for instance, agents are told that Dragon and Phoenix arms represent different reward distributions.

Rather than being told the probabilities directly, they must reason symbolically—e.g., interpreting Dragon as “strength” and Phoenix as “hope”—to predict outcomes. This kind of setup pressures the model to generate explainable, analogical reasoning.

Stabilizing reinforcement learning with StarPO-S

To address training collapse, the researchers introduced StarPO-S, a stabilized version of the original framework. StarPO-S incorporates three key interventions:

  1. Uncertainty-based rollout filtering: Prioritizing rollouts where the agent shows outcome uncertainty.
  2. KL penalty removal: Allowing the model to deviate more freely from its original policy and explore new behaviors.
  3. Asymmetric PPO clipping: Amplifying high-reward trajectories more than low-reward ones to boost learning.

These changes delay or eliminate training collapse and improve performance across all three tasks. As Wang put it: “StarPO-S… works across all 3 tasks. Relieves collapse. Better reward.”

What makes for a good agentic AI model?

The success of RL training hinges not just on architecture, but on the quality of the data generated by the agents themselves. The team identified three dimensions that significantly impact training:

  • Task diversity: Exposing the model to a wide range of initial scenarios improves generalization.
  • Interaction granularity: Allowing multiple actions per turn enables more meaningful planning.
  • Rollout freshness: Keeping training data aligned with the current model policy avoids outdated learning signals.

Together, these factors make the training process more stable and effective.

An interactive demo site published by the researchers on Github makes this explicit, visualizing agent rollouts as full dialogue turns—including not just actions, but the step-by-step thought process that preceded them.

For example, in solving a math problem, an agent may first ‘think’ about isolating a variable, then submit an answer like ‘x = 5’. These intermediate thoughts are visible and traceable, which adds transparency into how agents arrive at decisions.

When reasoning runs out

While explicit reasoning improves performance in simple, single-turn tasks like Bandit, it tends to decay during multi-turn training. Despite the use of structured prompts and  tokens, reasoning traces often shrink or vanish unless directly rewarded.

This points to a limitation in how rewards are typically designed: focusing on task completion may neglect the quality of the process behind it. The team experimented with format-based penalties to encourage better-structured reasoning, but acknowledges that more refined reward shaping is likely needed.

RAGEN, along with its StarPO and StarPO-S frameworks, is now available as an open-source project at https://github.com/RAGEN-AI/RAGEN. However, no explicit license is listed in the GitHub repository at the time of writing, which may limit use or redistribution by others.

The system provides a valuable foundation for those interested in developing AI agents that do more than complete tasks—they think, plan, and evolve.

As AI continues to move toward autonomy, projects like RAGEN help illuminate what it takes to train models that learn not just from data, but from the consequences of their own actions.

Outstanding Questions for Real-World Adoption

While the RAGEN paper offers a detailed technical roadmap, several practical questions remain for those looking to apply these methods in enterprise settings. For example, how transferable is RAGEN’s approach beyond stylized, symbolic tasks? Would businesses need to design entirely new environments and reward functions to use this system in workflows like invoice processing or customer support?

Another critical area is scalability. Even with the enhancements provided by StarPO-S, the paper acknowledges that training still eventually collapses over longer horizons. This raises the question: is there a theoretical or practical path to sustaining reasoning over open-ended or continuously evolving task sequences?

At the time of writing, no explicit license is listed in the RAGEN GitHub repository or documentation, leaving open questions about usage rights.

To explore these and other questions—including how non-technical decision-makers should interpret RAGEN’s implications—I reached out to co-author Wang for further insight. At the time of writing, a response is pending. Should any comments arrive, they will be included in a follow-up to this article or integrated as an update.

RAGEN stands out not just as a technical contribution but as a conceptual step toward more autonomous, reasoning-capable AI agents. Whether it becomes part of the enterprise AI stack remains to be seen, but its insights into agent learning dynamics are already helping redefine the frontier of LLM training.