Back to Articles

Inside Tongyi DeepResearch: How Alibaba Built a 30B-Parameter AI Research Agent with Synthetic Training Data

[ View on GitHub ]

Inside Tongyi DeepResearch: How Alibaba Built a 30B-Parameter AI Research Agent with Synthetic Training Data

Hook

What if you could train a state-of-the-art AI research agent without a single manually-labeled example? Alibaba’s DeepResearch does exactly that, using a fully automated synthetic data pipeline to create a model that outperforms larger competitors on complex information-seeking tasks.

Context

Traditional LLMs excel at answering questions they’ve seen during training, but they struggle with the messy, iterative process that defines real research: formulating sub-questions, searching multiple sources, reconciling conflicting information, and synthesizing findings into coherent reports. Existing solutions fall into two camps: closed-source services like Perplexity that hide their architectures behind APIs, or general-purpose models retrofitted with tool-calling capabilities that weren’t designed for long-horizon research workflows.

Tongyi DeepResearch emerges from Alibaba’s realization that research-specific capabilities require research-specific training. Rather than treating web search and document analysis as afterthoughts bolted onto a conversational model, they built an agent from the ground up for information-seeking tasks. The key innovation isn’t just the 30.5B parameter Mixture-of-Experts architecture—it’s the three-stage training pipeline that generates all its own training data through synthetic interactions, eliminating the bottleneck of human annotation that has constrained previous agent development efforts.

Technical Insight

Inference Modes

Training Pipeline

Synthetic trajectories

Model checkpoints

Refined model

Trained agent

Trained agent

Tool calls

Tool calls

Observations

Observations

Fast inference

Test-time compute

Base MoE Model

3.3B active params

Stage 1: Continual Pre-training

Synthetic Agentic Trajectories

Stage 2: Supervised Fine-tuning

High-Quality Examples

Stage 3: GRPO

Token-level RL

ReAct Mode

Fast Action-Observation Loop

IterResearch Mode

Multi-branch Reasoning

External Tools

Search/Reader/Sandbox

Agent Response

+ Trajectory

System architecture — auto-generated

DeepResearch’s architecture centers on a sparse MoE model with 3.3B active parameters per token, providing the inference efficiency of a small model with the capacity of a much larger one. But the real engineering achievement is the training pipeline. Stage one performs continual pre-training on synthetic agentic trajectories—not human conversations, but programmatically generated sequences of tool calls, observations, and reasoning steps. Stage two applies supervised fine-tuning on higher-quality synthetic examples. Stage three uses Group Relative Policy Optimization (GRPO), an on-policy RL approach with token-level gradients that handles the non-stationary environment created by the agent’s own improving policy.

The framework supports two inference modes. The lightweight ReAct mode follows the standard action-observation loop:

from deepresearch import DeepResearchAgent

agent = DeepResearchAgent(
    model_path="Alibaba-NLP/DeepResearch",
    tools=["serper_search", "jina_reader", "python_sandbox"]
)

# ReAct mode for faster inference
response = agent.query(
    question="What are the key differences between attention mechanisms in Vision Transformers versus traditional CNNs?",
    mode="react",
    max_iterations=10
)

print(response.answer)
for step in response.trajectory:
    print(f"Action: {step.action}, Tool: {step.tool}, Observation: {step.observation[:100]}...")

The ‘Heavy’ IterResearch mode implements test-time compute scaling by allowing the model to spawn multiple reasoning branches, explore different search strategies, and iteratively refine its understanding. This trades latency for accuracy on complex queries that benefit from deeper exploration.

What makes the synthetic data generation particularly clever is how it handles the chicken-and-egg problem of needing good agent behavior to generate training data for good agent behavior. The system bootstraps from a base model’s reasoning capabilities, generates initial trajectories using heuristic policies, filters them based on outcome quality, and then uses the trained model to generate increasingly sophisticated examples for the next training stage. The RL phase applies leave-one-out advantage estimation and selective negative sampling to avoid reward hacking—filtering out trajectories where the agent appears to succeed through spurious correlations rather than genuine reasoning.

The tool integration layer abstracts external services behind a unified interface. When the model generates a tool call, the framework handles API authentication, rate limiting, retry logic, and result formatting:

# Example of extending with custom tools
from deepresearch.tools import BaseTool

class ArXivSearchTool(BaseTool):
    def __init__(self, api_key):
        self.api_key = api_key
        self.name = "arxiv_search"
        self.description = "Search academic papers on arXiv"
    
    def execute(self, query: str, max_results: int = 5) -> dict:
        # Implementation with error handling, retries, etc.
        results = self._call_arxiv_api(query, max_results)
        return {
            "papers": results,
            "count": len(results),
            "query": query
        }

agent.register_tool(ArXivSearchTool(api_key=ARXIV_KEY))

This design means you can extend the agent’s capabilities by implementing the BaseTool interface rather than fine-tuning the model for new domains. The MoE architecture further enables this extensibility—different expert modules specialize in different tool interactions, learned implicitly during training rather than through explicit routing rules.

Gotcha

The setup friction is real. You’ll need API keys for Serper (web search), Jina (content extraction), and Dashscope (document parsing), plus a Python execution sandbox if you want the full capability set. Each service has its own rate limits, pricing tiers, and failure modes. During testing, the Jina reader occasionally times out on JavaScript-heavy sites, Serper’s free tier burns through quickly with deep research tasks, and the Python sandbox requires careful security configuration if you’re exposing it beyond localhost. The documentation assumes you’re familiar with these services and doesn’t hold your hand through the gotchas—like Jina’s rendering quirks with certain PDF formats or Serper’s regional availability limitations.

Performance is another consideration. Even with the MoE efficiency gains, you’re running a 30B parameter model. On a single A100, expect 2-3 seconds per token in ReAct mode, and IterResearch can take minutes for complex queries as it explores multiple reasoning paths. The public demo frequently hits capacity limits, and the suggestion to use Alibaba’s managed Bailian service feels less like a helpful alternative and more like an acknowledgment that self-hosting isn’t practical for most users. The specific Python 3.10.0 requirement also caused dependency conflicts with other ML tools in our environment—downgrading PyTorch to a compatible version introduced subtle numerical differences that affected reproducibility.

Verdict

Use if: You’re building research automation for domains with complex, multi-step information needs (competitive intelligence, literature reviews, technical due diligence) and have the infrastructure to run 30B+ parameter models, or you’re willing to use Alibaba Cloud’s managed service. The synthetic training approach and strong benchmark results make this the current state-of-the-art for open-source research agents, and the extensible tool framework means you can adapt it to specialized domains without retraining. Skip if: You need simple question-answering that GPT-4 with web search handles fine, want a zero-configuration solution without API juggling, have limited compute (anything less than 40GB VRAM will struggle), or require consistent sub-second response times. For most teams, starting with a simpler agent framework and only graduating to DeepResearch when you hit capability limits makes more practical sense than leading with a 30B parameter model.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/alibaba-nlp-deepresearch.svg)](https://starlog.is/api/badge-click/ai-agents/alibaba-nlp-deepresearch)