Feynman: Multi-Agent Research Orchestration with Source-Grounded Citations
Hook
Most AI research assistants give you an answer. Feynman gives you an answer, three dissenting opinions, a peer review, and URLs to every paper it referenced—all from a single command.
Context
Academic research has a reproducibility crisis, and AI-assisted research risks making it worse. LLMs confidently cite papers that don’t exist, hallucinate methodology details, and blend factual claims with plausible-sounding fiction. For researchers trying to survey a field, audit a paper’s claims against its codebase, or replicate experiments, existing tools offer either conversational flexibility without citations (ChatGPT) or structured search without synthesis (Semantic Scholar). You’re left manually cross-referencing claims, hunting down repositories, and verifying whether that clever technique the AI mentioned actually appears in the cited paper.
Feynman takes a different approach: source-grounding as infrastructure, not afterthought. It’s a CLI research agent that refuses to make claims it can’t link to a paper, repository, or documentation URL. Instead of a single LLM call, it orchestrates multiple specialized agents—Researcher, Reviewer, Writer, Verifier—that execute workflows defined as Markdown instruction files called ‘skills.’ The result is research tooling that feels less like talking to an assistant and more like managing a team of junior researchers who compulsively cite their sources.
Technical Insight
Feynman’s architecture centers on the Pi agent runtime, which treats capabilities as composable skills rather than hardcoded functions. When you run feynman deepresearch "mechanistic interpretability", you’re not invoking a monolithic research function—you’re triggering a workflow that dispatches multiple agent instances with different roles and tool access. The Researcher agents run in parallel, each exploring different angles: one queries AlphaXiv for academic papers, another scrapes project documentation, a third examines GitHub repositories for implementation details. The Verifier agent then cross-checks claims, ensuring every assertion links back to a primary source.
This multi-agent pattern solves a key problem with single-agent research: depth versus breadth. A single LLM context window forces you to choose between exploring many sources shallowly or few sources deeply. Feynman’s parallelized researchers can fan out across dozens of papers simultaneously, then reconverge with synthesized findings. The system inherits this orchestration model from Pi’s skill architecture, where each skill is a Markdown instruction file synced to ~/.feynman/agent/skills/ on startup.
Here’s what a simplified workflow looks like when you audit a paper against its codebase:
$ feynman audit 2401.12345
# Feynman fetches paper from ArXiv ID
# Researcher agent extracts claimed contributions
# AlphaXiv integration locates associated GitHub repo
# Docker spins up isolated container to run code
# Verifier cross-references paper claims vs. actual implementation
# Output: mismatch report with line-by-line citations
The AlphaXiv integration deserves special attention. The tool provides paper search, Q&A, code reading, and annotations via the alpha CLI tool. When Feynman queries it, the system appears to get back not just paper metadata but contextual information about methodology, datasets, and reproducibility signals. This enables the Researcher agent to extract structured information rather than relying on hallucinated guesses.
For experiment replication, Feynman supports hybrid execution. Local experiments run in Docker containers—isolated, reproducible, but limited by your hardware. For GPU-intensive workloads, the /replicate command can burst to Modal (serverless GPU) or RunPod (persistent pods with SSH). This matters because reproducibility research often hits a wall at compute: you can read the paper, find the code, but can’t afford to rent a V100 cluster for three days. Feynman’s Modal integration means you write the experiment locally, then dispatch it to ephemeral cloud GPUs with a flag.
The verification layer is where source-grounding becomes structural. Every claim in Feynman’s output includes inline citations—not LLM-generated references, but direct URLs verified by the Verifier agent. If a paper link 404s, the Verifier flags it. If a claim references a GitHub repo, the output includes the commit hash. The verification process ensures citations are checked before outputs are finalized.
Extensibility comes through the skills system. The one-line installer bundles a full terminal app, but you can also install just the skills library into ~/.codex/skills/feynman and use them from any Pi-compatible agent runtime. This separation means you can treat Feynman’s research workflows as building blocks—import the /lit skill into your own agent, customize the Reviewer prompts, or chain research outputs into downstream automation. Skills are versioned Markdown files, so you can pin specific research behaviors or fork capabilities without touching TypeScript.
Gotcha
The source-grounding guarantee comes with infrastructure dependencies. Feynman requires AlphaXiv for paper search, though the README doesn’t document error handling for API failures, rate limits, or what happens when a paper exists but has no associated code repository. You’re depending on a third-party service for core functionality.
Cloud GPU replication introduces cost and configuration complexity. Modal and RunPod aren’t free, and the README provides no guidance on cost implications—how much does replicating a typical experiment cost? What’s the billing model? How do you set budget limits? For the /replicate workflow to work, you need Modal or RunPod accounts configured, which means API keys, billing setup, and understanding two different cloud platforms. The local Docker fallback works for simple scripts, but real ML experiments need GPUs, and suddenly you’re managing infrastructure.
The multi-agent orchestration lacks transparency about decision-making. When Researcher agents disagree—one paper says technique X improves accuracy, another shows no effect—the README doesn’t explain how Feynman synthesizes consensus, whether there are voting mechanisms, or how confidence is scored. You get cited outputs, but the reasoning chain that produced them is hidden. For a tool targeting reproducibility-conscious researchers, this lack of visibility into the agent decision process is a significant limitation.
Verdict
Use Feynman if you’re doing ML/AI literature reviews, need to audit papers against codebases, or want reproducible research artifacts with citation trails baked in. The multi-agent deepresearch and paper audit features are genuinely useful for researchers tired of manually cross-referencing claims. The skills-as-Markdown architecture makes it easy to extend or integrate into existing Pi-based workflows. Best suited for command-line-comfortable researchers who value traceable outputs over conversational UX and who already work in the academic ML space where AlphaXiv has coverage. Skip it if you need a GUI, require transparent agent reasoning chains, or want to avoid cloud GPU billing complexity. Also skip if your research domain is outside ML/AI—AlphaXiv’s paper coverage appears domain-specific, and Feynman’s value proposition depends heavily on it. For web-scale research across arbitrary domains, you’re better off with Perplexity Pro or building custom LangChain workflows against Semantic Scholar’s API.