Back to Articles

Inside Tongyi DeepResearch: Building Research Agents with Mixture-of-Experts and On-Policy RL

[ View on GitHub ]

Inside Tongyi DeepResearch: Building Research Agents with Mixture-of-Experts and On-Policy RL

Hook

Most LLMs answer questions. Tongyi DeepResearch spends variable time orchestrating multiple web searches, reading papers, and synthesizing multi-source reports—then learns from its own failures through on-policy reinforcement learning.

Context

The gap between answering a question and conducting research is vast. When you ask a typical LLM about quantum computing advances, you get a snapshot from its training cutoff. When you ask a proper research agent, you want it to search recent papers, cross-reference claims, identify contradictions, and synthesize a comprehensive report—the kind of work that takes significant research effort.

Tongyi DeepResearch tackles this long-horizon information-seeking problem with a specialized architecture: a 30.5B parameter mixture-of-experts model with only 3.3B parameters activated per token, trained specifically for multi-step research tasks. The system underwent continual pre-training on synthetic agentic interaction data, supervised fine-tuning on research workflows, and custom reinforcement learning to handle the non-stationary environment of web research. The result is state-of-the-art performance on benchmarks including Humanity’s Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and SimpleQA—tasks requiring multiple coordinated search-and-synthesis steps.

Technical Insight

The architecture centers on a three-stage training pipeline that’s unusual in its commitment to synthetic data generation. Tongyi’s team built a fully automated pipeline generating training data across pre-training, supervised fine-tuning, and RL stages without manual curation. This matters because high-quality agentic interaction data—complete traces of planning, tool calls, error recovery, and synthesis—is scarce and expensive to collect at scale.

The model itself uses a mixture-of-experts architecture where 30.5B total parameters are available, but only 3.3B activate for any given token. This sparse activation pattern keeps inference costs reasonable while maintaining the capacity needed for complex reasoning. The 128K context window is critical for research tasks—when you’re reading multiple academic papers and synthesizing findings, you need room to keep source material in context without aggressive summarization that loses nuance.

At inference, the system supports two distinct modes. The lightweight ReAct mode follows the standard reason-act-observe loop. The model generates a thought, decides on a tool call (web search via Serper API, page reading via Jina, file parsing via Dashscope), observes the result, and continues. Here’s pseudocode representative of the architecture:

# Pseudocode representation of ReAct inference loop
query = "What are the latest breakthroughs in quantum error correction?"
context = []
max_steps = 20

for step in range(max_steps):
    # Model generates thought and action
    thought, action, action_input = model.generate(
        query=query,
        context=context,
        mode="react"
    )
    
    # Execute tool call
    if action == "search":
        results = serper_api.search(action_input)
    elif action == "read_page":
        content = jina_api.read(action_input)
    elif action == "parse_document":
        parsed = dashscope_api.parse(action_input)
    
    # Add observation to context
    context.append({
        "thought": thought,
        "action": action,
        "observation": results
    })
    
    # Check if research is complete
    if action == "finish":
        return model.synthesize_report(context)

The ‘Heavy’ mode uses IterResearch-based test-time scaling—essentially allowing the model to spend more compute during inference by iterating on its research plan, backtracking when searches prove unfruitful, and refining queries based on partial findings. This is where the model achieves maximum performance on benchmarks, though with variable response times.

The reinforcement learning stage deserves particular attention. Training agents is notoriously unstable because the environment itself changes as the model learns—a phenomenon called non-stationarity. Tongyi’s team developed a custom Group Relative Policy Optimization (GRPO) framework with three key innovations. First, token-level policy gradients provide fine-grained credit assignment, so the model learns which specific reasoning steps led to successful research outcomes. Second, leave-one-out advantage estimation compares each trajectory against others in its batch to reduce variance. Third, selective filtering of negative samples prevents the model from over-updating on spurious failures caused by API timeouts or transient search failures rather than reasoning errors.

The continual pre-training phase is also worth examining. Rather than fine-tuning a general LLM on agentic tasks, they continued pre-training on a corpus of synthetic agentic interactions. This teaches the model the structure of multi-step research at a deeper level than instruction-following alone. The model learns implicit patterns like “when initial search results are too general, refine the query with domain-specific terminology” or “when sources contradict, search for meta-analyses or recent reviews.” These aren’t explicitly programmed behaviors—they emerge from exposure to diverse research trajectories during pre-training.

The external tool integration is pragmatic rather than elegant. Web search goes through Serper’s API, page reading through Jina’s reader API, file parsing through Dashscope, and code execution through SandboxFusion. Each integration point is a potential failure mode, which the online demos acknowledge explicitly with warnings about intermittent failures and variable response times. The model includes capabilities to handle tool failures, but the dependency chain remains a fundamental architectural constraint.

Gotcha

The operational reality of running Tongyi DeepResearch is significantly more complex than the benchmark numbers suggest. The system requires Python 3.10.0 specifically—not 3.10.x, not 3.11—and a constellation of API keys for external services. Serper for web search, Jina for page reading, Dashscope for document parsing, and SandboxFusion for code execution. Each of these is a paid service with rate limits and potential downtime. The online demos include explicit warnings that responses “may vary or fail intermittently due to model latency and tool QPS limits,” which is a polite way of saying the system isn’t production-ready out of the box.

Latency is the other consideration. The ‘Heavy’ mode that achieves state-of-the-art benchmark performance involves multiple research iterations. This makes sense for deep research tasks—you’re automating significant research work—but it’s a poor fit for interactive applications requiring quick responses. Even the lightweight ReAct mode involves multiple round-trips to external APIs, each adding latency. The 30.5B parameter model also requires substantial GPU memory even with sparse activation. The README’s quick start guide conspicuously lacks actual deployment instructions beyond downloading the weights, suggesting the team expects users to have existing LLM serving infrastructure rather than providing a turnkey solution.

Verdict

Use Tongyi DeepResearch if you’re building applications where research depth justifies extended response times: academic literature reviews, competitive intelligence reports, comprehensive fact-checking workflows, or investigative research assistants. The specialized training for long-horizon information-seeking tasks delivers results optimized for synthesizing information across multiple sources or tracking down technical details. The MoE architecture keeps inference costs reasonable relative to the task complexity, and the 128K context window handles the long documents typical of research workflows. It’s also a valuable resource if you’re researching agent architectures—the synthetic data generation pipeline and custom GRPO implementation offer practical lessons beyond what academic papers typically reveal. Skip it if you need conversational AI, have strict latency requirements, lack the infrastructure to manage multiple external API dependencies, or want a simple pip-installable library. The deployment complexity and external service dependencies make this a poor choice for prototyping or small-scale applications. Tongyi DeepResearch occupies a specific niche: batch research tasks where quality matters more than speed and you have the operational maturity to manage a complex, multi-service deployment.

// QUOTABLE

Most LLMs answer questions. Tongyi DeepResearch spends variable time orchestrating multiple web searches, reading papers, and synthesizing multi-source reports—then learns from its own failures thr...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/alibaba-nlp-deepresearch.svg)](https://starlog.is/api/badge-click/developer-tools/alibaba-nlp-deepresearch)