Back to Articles

STORM: Building Wikipedia-Style Reports Through Simulated Expert Conversations

[ View on GitHub ]

STORM: Building Wikipedia-Style Reports Through Simulated Expert Conversations

Hook

What if the best way to make an LLM research a topic wasn't to ask it directly, but to make it argue with itself? STORM's 70,000+ users are proving that simulated expert conversations generate deeper research than any single prompt ever could.

Context

Ask GPT-4 to write a comprehensive article about quantum computing, and you'll get something that sounds authoritative but lacks depth, misses important perspectives, and hallucinates citations. The problem isn't the model's capability—it's that a single prompt forces the LLM into a local minimum of knowledge. Human researchers don't work this way. They start broad, discover perspectives they hadn't considered, ask follow-up questions, chase down references, and iteratively refine their understanding.

STORM, developed by Stanford's Oval lab and published at NAACL 2024, tackles this by breaking knowledge curation into stages that mirror human research workflows. Instead of generating articles directly, it first conducts research through simulated conversations between a Wikipedia writer persona and topic experts, grounded in web search results. This pre-writing stage discovers diverse perspectives, generates outlines, and collects properly cited references before any article writing begins. The result is a system that has been tested by over 70,000 users for Wikipedia-style article creation, demonstrating that architecture matters as much as model size when building LLM research systems.

Technical Insight

STORM's architecture is built on two foundational insights: perspective diversity and conversational iteration. The pre-writing stage begins with perspective-guided question asking, where the system first identifies similar topics to discover viewpoints the researcher might not have considered. For a topic like "impact of social media on democracy," STORM doesn't just ask obvious questions—it surveys related articles to discover perspectives from psychology, political science, technology ethics, and media studies angles.

The system then simulates conversations between a writer agent and multiple expert agents, each grounded in real-time web search results. Here's how you might customize the conversation depth using STORM's Python API:

from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner
from knowledge_storm import STORMWikiLMConfigs
from dspy import OpenAI

# Configure LLM settings for different components
lm_configs = STORMWikiLMConfigs()
lm_configs.set_conv_simulator_lm(OpenAI(model='gpt-4', max_tokens=500))
lm_configs.set_question_asker_lm(OpenAI(model='gpt-4', max_tokens=300))
lm_configs.set_outline_gen_lm(OpenAI(model='gpt-4', max_tokens=1000))
lm_configs.set_article_gen_lm(OpenAI(model='gpt-4', max_tokens=3000))

# Configure research depth
args = STORMWikiRunnerArguments(
    max_conv_turn=5,  # Conversation depth per perspective
    max_perspective=3,  # Number of expert perspectives
    search_top_k=10,  # References per search query
)

runner = STORMWikiRunner(args, lm_configs)
runner.run(
    topic='Impact of Social Media on Democracy',
    do_research=True,
    do_generate_outline=True,
    do_generate_article=True
)

This modular design, built on DSPy, allows you to swap LLM providers (via litellm), customize retrieval sources, or even replace components entirely. The conversation simulation isn't just theater—each turn allows the writer agent to ask follow-up questions based on previous answers, mimicking how humans dig deeper when they encounter surprising information.

The writing stage then synthesizes the collected outline and references into a coherent article with proper citations. Because references were collected during conversation simulation rather than generated post-hoc, citation accuracy improves dramatically. STORM tracks which search results supported which parts of the conversation, creating a grounded citation graph.

Co-STORM extends this architecture with human-in-the-loop collaboration, introducing a turn management protocol where human users can interrupt LLM experts, steer conversations, and contribute their own knowledge. The system maintains a dynamic mind map that organizes discovered information hierarchically, reducing cognitive load when exploring complex topics. This collaborative discourse model represents a shift from autonomous agents toward AI systems that augment human research workflows rather than replacing them.

The technical cleverness lies in how STORM avoids the pitfalls of naive RAG implementations. Instead of retrieving once and generating, it interleaves retrieval with iterative questioning. Instead of a single perspective, it explicitly seeks diversity. Instead of direct article generation, it builds scaffolding through outlines and structured conversations. These architectural choices compound: better questions lead to better retrieval, which enables deeper follow-ups, which produce richer outlines, which guide more coherent article generation.

Gotcha

STORM is not a magic "research paper generator" button. The output requires substantial human editing and fact-checking—the developers explicitly position it as a pre-writing tool, not a publication-ready article creator. You're essentially getting a well-researched first draft with an outline and citations that need verification. For Wikipedia articles or comprehensive reports, this is valuable. For blog posts or marketing content where you need polish and voice, the editing overhead might exceed writing from scratch.

The economics can also hurt. Simulating multiple conversation turns with multiple expert perspectives means many LLM API calls per article—potentially dozens of GPT-4 requests for a single topic. During testing, comprehensive reports can cost $5-15 in API fees depending on topic complexity and configuration. If you're generating hundreds of reports, those costs compound quickly. The retrieval quality ceiling is another constraint: STORM can only be as good as its search backend. If your topic requires access to specialized databases, academic journals, or proprietary sources that aren't web-accessible, the simulated conversations will be grounded in incomplete information. The system also struggles with highly narrow or emerging topics where there aren't enough similar articles to discover diverse perspectives—the perspective-guided approach assumes a landscape of related content to survey.

Verdict

Use if you're building research tools for academics, journalists, or content strategists who need comprehensive reports on unfamiliar topics with proper citations. STORM excels when research breadth matters more than writing polish, when you're willing to invest API costs for quality, and when your topics are well-covered enough on the web to support perspective discovery. It's particularly valuable if you're creating Wikipedia-style encyclopedic content or need to get up to speed on complex topics quickly with a structured outline and reference list. Skip if you need publication-ready content without editing, are working with highly specialized topics in closed domains, need fast answers rather than deep research, or are optimizing for API cost efficiency over research thoroughness. For quick summaries or narrow technical documentation, traditional RAG or search-based systems will serve you better at a fraction of the cost.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/stanford-oval-storm.svg)](https://starlog.is/api/badge-click/ai-agents/stanford-oval-storm)