Back to Articles

Tongyi DeepResearch: The First Open-Source MoE Agent Built for Multi-Step Research

[ View on GitHub ]

Tongyi DeepResearch: The First Open-Source MoE Agent Built for Multi-Step Research

Hook

While most LLMs are trained on static text, Tongyi DeepResearch learned from millions of synthetic agent interactions—teaching it to think like a researcher who knows when to search, when to read, and when to synthesize.

Context

Large language models excel at answering questions they've memorized during training, but they struggle with complex research tasks that require multiple steps: searching for current information, reading through lengthy documents, synthesizing contradictory sources, and iteratively refining queries based on partial findings. Traditional approaches either use general-purpose LLMs with bolted-on tool access (leading to clumsy API calls and poor reasoning about when to use which tool) or rely on hand-crafted agent frameworks that break down on anything beyond simple workflows.

Alibaba's NLP team identified a fundamental gap: no open-source model was specifically optimized for the deep research use case. Models like GPT-4 can use tools, but they weren't trained to think like researchers. Autonomous agent frameworks like AutoGPT provide the scaffolding, but use models that don't understand long-horizon planning. DeepResearch attacks this problem at the model level, using a three-phase training pipeline that teaches a 30.5B parameter Mixture-of-Experts architecture to natively perform multi-step information-seeking tasks through continual pre-training on synthetic agentic data, supervised fine-tuning on research trajectories, and end-to-end reinforcement learning with specialized policy optimization.

Technical Insight

The core architectural innovation is training the model on synthetic agentic interaction data before fine-tuning. Most instruction-tuned models learn tool use through supervised examples alone, but DeepResearch undergoes large-scale continual pre-training on millions of generated research trajectories. This teaches the model fundamental patterns of when to search versus when to synthesize, how to formulate better follow-up queries based on partial answers, and how to integrate information across multiple sources.

The MoE architecture uses 30.5B total parameters with only 3.3B activated per token, providing a sweet spot between inference efficiency and capacity. Each forward pass routes tokens through specialized expert networks, allowing the model to develop distinct reasoning patterns for different research subtasks without the full computational cost of a 30B dense model. This matters for production deployments where you're running hundreds of multi-step research queries daily—the active parameter count determines your GPU memory bandwidth requirements.

At inference time, DeepResearch supports two operational modes through different prompting strategies. The standard ReAct mode follows a thought-action-observation loop familiar from other agent frameworks:

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for current information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "num_results": {"type": "integer", "default": 5}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_page",
            "description": "Read and extract content from a URL",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {"type": "string"}
                },
                "required": ["url"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="qwen-plus-2025-01-24",  # DeepResearch checkpoint
    messages=[
        {"role": "user", "content": "What are the latest developments in quantum error correction as of 2025?"}
    ],
    tools=tools,
    tool_choice="auto"
)

# Model generates tool calls based on research needs
if response.choices[0].message.tool_calls:
    for tool_call in response.choices[0].message.tool_calls:
        print(f"Tool: {tool_call.function.name}")
        print(f"Args: {tool_call.function.arguments}")

The 'Heavy' mode switches to IterResearch prompting, which implements test-time scaling by encouraging the model to iterate multiple times on the same query, refining its understanding with each pass. This consumes more tokens but produces higher-quality results on complex questions where the first search results don't fully answer the query.

The reinforcement learning phase uses Group Relative Policy Optimization (GRPO), a variant designed for non-stationary environments where tool APIs return different results over time. Traditional PPO struggles with research agents because the same action sequence yields different rewards when web search results change. GRPO addresses this with token-level policy gradients and leave-one-out advantage estimation—it groups trajectories by similarity, computes advantages relative to the group mean rather than a global baseline, and updates the policy at the token level rather than trajectory level. This enables stable training even when external tool outputs vary.

The synthetic data generation pipeline is equally sophisticated. Rather than manually curating research trajectories, the system uses stronger models to generate questions, then has weaker models attempt them while a critic evaluates the quality of tool usage, information extraction, and final synthesis. High-quality trajectories become training data, creating a self-improving flywheel. This automation is crucial—the model's benchmark performance on Humanity's Last Exam (a dataset of extremely difficult questions) exceeds GPT-4's, suggesting the synthetic data distribution successfully captures real-world research complexity.

Tool integration happens through standardized APIs: Serper for web search, Jina Reader for HTML-to-markdown conversion, Dashscope for file parsing (PDFs, DOCs), and SandboxFusion for safe Python execution. The model learns to chain these tools effectively through its training rather than relying on hard-coded heuristics. It might search, read three pages, realize it needs more specific information, construct a refined query, search again, then synthesize findings—all autonomously based on the intermediate results.

Gotcha

The deployment story is rougher than the benchmarks suggest. DeepResearch requires four external API dependencies just to function, and the free demo's instability issues (mentioned in the README warnings about 'failures' and 'long response times') aren't theoretical. Running this in production means either paying for commercial API access to Serper, Jina, and Dashscope, or replacing them with self-hosted alternatives—which defeats much of the convenience. The Python 3.10.0 hard requirement creates dependency hell in mixed environments, and you'll need substantial GPU resources (the 30.5B parameter count means at least 60GB VRAM for fp16 inference, realistically an A100 or better).

The model's research quality advantage comes at a latency cost. Multi-step agentic workflows inherently take longer than single-shot generation, and the 'Heavy' mode's test-time scaling can consume thousands of tokens per query. If you're building a user-facing feature where sub-second response times matter, this architecture won't work—you're looking at 10-60 second research sessions depending on query complexity. The tool-calling overhead compounds this; every web search or page read adds network round-trips that block the generation pipeline.

Verdict

Use if: You're building applications that need genuine multi-step research capabilities (scientific literature review, comprehensive fact-checking, competitive intelligence gathering) where accuracy justifies 30+ second response times, you have GPU infrastructure for 30B models or budget for Alibaba's managed Bailian service, and you can handle the operational complexity of multiple API dependencies. The model's specialized training makes it genuinely better at long-horizon research than general-purpose alternatives. Skip if: You need fast question-answering (under 5 seconds), lack the infrastructure for large model deployment, require simple single-step tool use where GPT-4 with function calling suffices, or want a batteries-included solution without API management overhead. For most production use cases, Perplexity's managed service or GPT-4 with custom tool integration will be simpler despite slightly lower benchmark scores.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/alibaba-nlp-deepresearch.svg)](https://starlog.is/api/badge-click/ai-agents/alibaba-nlp-deepresearch)