STORM: How Stanford Built a Two-Stage Pipeline for Wikipedia-Quality Research Reports
Hook
Over 70,000 users have tried STORM, Stanford’s research system that writes Wikipedia-style articles from scratch. But here’s the catch: even its creators admit it can’t produce publication-ready content without heavy editing.
Context
Writing comprehensive research articles is tedious. You start with a vague topic, spiral through dozens of search results, copy-paste snippets into a document, lose track of sources, realize you missed entire subtopics, and eventually produce something that reads like a patchwork quilt of half-understood ideas. Large language models promised to automate this, but directly prompting GPT-4 to “write an article about X” produces shallow, often hallucinated content that lacks the depth of proper research.
STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking) takes a different approach. Instead of treating article generation as a single prompt, it models the actual research process: discovering different perspectives, asking questions, gathering sources, organizing findings, and synthesizing everything into a coherent narrative. Built by Stanford’s OVAL lab and accepted at NAACL 2024, STORM breaks the problem into two distinct stages—pre-writing and writing—mirroring how human researchers actually work. Its successor, Co-STORM, extends this with collaborative features that let humans steer the research direction in real-time.
Technical Insight
STORM’s architecture is built on dspy, Stanford’s framework for programming language models as composable modules. The system’s core insight is that good research comes from asking good questions, and good questions come from understanding multiple perspectives on a topic.
The pre-writing stage implements what the authors call “perspective-guided question asking.” Instead of directly prompting an LLM to generate questions about a topic, STORM first surveys similar existing articles to discover different viewpoints. If you’re researching “quantum computing,” it might identify perspectives like “hardware engineer,” “algorithm researcher,” or “industry analyst.” Each perspective then guides a simulated conversation between a Wikipedia writer agent and a topic expert agent. The expert answers questions grounded in actual internet sources retrieved through modules like YouRM, BingSearch, or VectorRM (for custom document collections). This conversation pattern enables iterative understanding—the writer can ask follow-ups based on previous answers, mimicking how real research deepens over time.
Here’s how you’d instantiate the basic STORM pipeline using the knowledge-storm package:
from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
from knowledge_storm.lm import LitellmModel
from knowledge_storm.rm import YouRM
import os
# Configure language models for different pipeline stages
lm_configs = STORMWikiLMConfigs()
openai_kwargs = {
'api_key': os.getenv("OPENAI_API_KEY"),
'temperature': 1.0,
'top_p': 0.9,
}
# STORM uses different models for different components to balance cost and quality
gpt_35 = LitellmModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs) # Cheaper model for conversations
gpt_4 = LitellmModel(model='gpt-4', max_tokens=3000, **openai_kwargs) # Premium model for generation
# The system appears to support configuring these components through the LMConfigs object
# Configure retrieval module
rm = YouRM(ydc_api_key=os.getenv("YDC_API_KEY"))
# Set up runner arguments
args = STORMWikiRunnerArguments(
output_dir='./outputs',
max_conv_turn=3, # Conversation depth per perspective
max_perspective=3 # Number of perspectives to explore
)
# Initialize runner with configurations
runner = STORMWikiRunner(args, lm_configs, rm)
Notice the multi-model architecture: STORM uses cheaper models (GPT-3.5) for high-volume tasks like simulating conversations and splitting queries, but reserves expensive models (GPT-4) for outline generation and article writing where quality matters most. This cost-performance tradeoff is crucial when a single article might trigger hundreds of LLM calls.
The writing stage takes the collected references and generated outline, then synthesizes them into a full article with citations. Each section gets written separately, with the LLM instructed to ground claims in the specific references gathered during pre-writing. The modular dspy architecture means you can swap components—replace YouRM with BingSearch, use Claude instead of GPT-4, or implement custom retrieval over proprietary documents using VectorRM.
Co-STORM extends this with a collaborative discourse protocol. Instead of fully automated research, it orchestrates a conversation between LLM expert agents, a moderator agent that asks thought-provoking questions, and a human user who can observe or inject utterances to steer the discussion. The system maintains a dynamic mind map that organizes discovered information into a hierarchical concept structure, reducing cognitive load as conversations deepen. This mind map becomes a shared conceptual space between human and AI—you can see what the system knows, identify gaps, and guide exploration toward areas you care about.
Gotcha
STORM’s own creators are refreshingly honest: the system cannot produce publication-ready articles without significant editing. Wikipedia editors who tested it found value in the pre-writing stage, but the final output still requires substantial human refinement. This isn’t a tool you run once and publish; it’s a research assistant that handles the grunt work of source gathering and initial structuring.
The cost and latency implications are real. A comprehensive research session involves multiple API calls across different LLM components (conversation simulation, question generation, outline creation, article writing) plus external search API calls. Even with the cheaper-model-for-conversations optimization, researching a complex topic with multiple perspectives can rack up significant costs. Latency compounds too—you’re waiting for sequential stages to complete, and the pre-writing conversation simulations can’t be trivially parallelized because follow-up questions depend on previous answers. Quality depends heavily on your retrieval module and LLM choices. If you’re researching niche technical topics, general internet search might not surface the depth you need, and even GPT-4 will struggle to synthesize information it doesn’t have. The VectorRM module helps by letting you ground research in custom documents, but you’re still limited by what sources you can provide upfront.
Verdict
Use STORM if you’re bootstrapping research on unfamiliar topics where you need structured outlines with citations quickly, especially for exploratory work or pre-writing stages. The Co-STORM variant shines when human expertise should guide research direction—think investigative journalism, academic literature reviews, or strategic research where domain knowledge matters. The system excels at Wikipedia-style articles: broad topics with multiple perspectives where internet sources provide adequate coverage. Skip STORM if you need publication-ready content without editing (you won’t get it), have strict cost or latency budgets (the multi-stage pipeline is expensive), or research highly specialized domains where general search engines fall short. For simple factual queries or narrow technical documentation, the two-stage overhead is overkill—you’d be better served by a simpler RAG system. But for deep exploratory research where you’re genuinely learning a new domain, STORM’s perspective-guided approach surfaces angles you wouldn’t have thought to investigate manually.