Back to Articles

Feynman: Building Multi-Agent Research Pipelines with Source-Grounded AI

[ View on GitHub ]

Feynman: Building Multi-Agent Research Pipelines with Source-Grounded AI

Hook

Most AI research tools hallucinate citations—generating plausible-looking references to papers that don't exist. Feynman takes the opposite approach: it refuses to make claims without linking directly to source URLs, treating research as an evidence-gathering problem rather than a text generation task.

Context

The promise of AI-assisted research has been hobbled by a fundamental trust problem. LLMs confidently cite non-existent papers, misattribute findings, and blend fact with fabrication in ways that require manual verification of every claim. For developers conducting literature reviews, investigating competing approaches, or building on academic work, this means AI tools become research assistants that create more work than they save.

Feynman emerged from this frustration as a CLI-based research agent that prioritizes verifiable evidence over fluent prose. Built on the Pi agent runtime framework, it orchestrates multiple specialized agents—Researcher, Reviewer, Writer, and Verifier—each trained through 'skills' (Markdown instruction files) to perform distinct roles in a research pipeline. The tool integrates deeply with AlphaXiv, a platform for searching and analyzing academic papers, and supports various LLM providers including local models. Rather than treating research as a single-shot question-answering task, Feynman structures it as a multi-stage investigation where claims must survive peer review simulation and citation verification before reaching the final output.

Technical Insight

The core architectural innovation in Feynman is its skill-based agent orchestration system. Instead of monolithic prompts, agents consume Pi skills—structured Markdown files that define behaviors, constraints, and tool access. This creates a separation of concerns where the agent runtime handles execution while skills encode domain expertise.

Here's how a basic research workflow gets orchestrated:

// Simplified example of agent coordination
const researchPipeline = async (query: string) => {
  // Phase 1: Evidence gathering
  const researcher = new Agent('researcher', {
    skills: ['alphaxiv-search', 'github-analysis', 'web-search'],
    constraints: ['cite-all-claims', 'prefer-primary-sources']
  });
  
  const evidence = await researcher.investigate(query);
  
  // Phase 2: Critical evaluation
  const reviewer = new Agent('reviewer', {
    skills: ['peer-review-simulation', 'methodology-critique'],
    context: evidence
  });
  
  const critiques = await reviewer.evaluate(evidence);
  
  // Phase 3: Synthesis with citations
  const writer = new Agent('writer', {
    skills: ['technical-writing', 'citation-formatting'],
    constraints: ['source-grounded-only', 'include-direct-urls']
  });
  
  const draft = await writer.synthesize(evidence, critiques);
  
  // Phase 4: Verification
  const verifier = new Agent('verifier', {
    skills: ['citation-validation', 'fact-checking']
  });
  
  return await verifier.validate(draft);
};

This multi-agent approach creates natural checkpoints where claims get challenged before entering the final output. The Reviewer agent acts as an adversarial filter, simulating peer review by questioning methodology, identifying gaps, and flagging overclaimed conclusions.

The AlphaXiv integration deserves special attention because it transforms how the agents interact with academic literature. Rather than treating papers as opaque PDFs, AlphaXiv makes them queryable artifacts. Agents can ask questions about specific papers ("What evaluation metrics did this computer vision paper use?"), search across semantic relationships ("Find papers that critique transformer attention mechanisms"), and even analyze associated code repositories. This creates a research graph rather than a linear reading list.

For compute-intensive work—like running experiments from papers or analyzing large codebases—Feynman delegates to containerized environments. It supports Docker for local execution and integrates with cloud GPU providers (Modal for serverless functions, RunPod for persistent instances). This architectural decision keeps the CLI lightweight while enabling resource-intensive operations:

# Launch a research query that may need to run experiments
feynman research "Compare transformer implementations in PyTorch vs JAX" \
  --verify-code \
  --compute-provider modal

# The agent can now:
# 1. Find relevant papers and repos
# 2. Spin up GPU containers
# 3. Clone code, run benchmarks
# 4. Report verifiable performance comparisons

The autonomous research mode (/autoresearch) takes this further by creating self-directed investigation loops. Instead of answering a single query, the agent formulates sub-questions, explores tangents when promising, and builds a research knowledge graph over multiple iterations. The /watch command enables recurring monitoring—useful for tracking emerging research areas or keeping tabs on specific topics.

One particularly clever design choice: Feynman separates the bundled CLI app (which includes Node.js runtime for standalone execution) from the skills library. This means you can install just the skills and integrate them with other agent frameworks like Codex or OpenCode, treating Feynman's research methodology as a reusable component rather than a monolithic tool. The skills themselves are versioned and can be customized, creating a plugin-like ecosystem without formal API boundaries.

The source-grounding constraint manifests in the output format. Every claim links to a specific URL—a paper on arXiv, a GitHub repo, a documentation page. The Writer agent is instructed to prefer direct quotes with citations over paraphrasing, and the Verifier agent checks that links are valid and actually support the claims made. This creates research artifacts that function as audit trails, where readers can verify every assertion by following the citation chain.

Gotcha

The tight integration with external services creates fragility that becomes apparent under real-world usage. AlphaXiv is essential for academic paper analysis, but if the service is unavailable or rate-limited, large portions of Feynman's functionality break. Similarly, the cloud GPU integrations with Modal and RunPod add operational complexity—you need accounts, API keys, and budget allocation for compute-intensive research. There's no graceful degradation; research queries that trigger experiment execution will simply fail without proper GPU access configured.

The Docker requirement for safe code execution is sensible from a security perspective but creates friction in restricted environments. Many corporate networks block Docker, and cloud development environments (like GitHub Codespaces or Replit) often have limitations on container execution. The installation script downloads pre-bundled binaries, which raises legitimate security concerns in enterprise contexts where executable provenance matters. You're trusting the build pipeline of a relatively young project (6,798 stars suggests growing but not mature adoption).

The TypeScript codebase and lack of comprehensive plugin documentation signal a project still finding its API boundaries. While the skills-based approach is elegant, there's no formal extension system—customization means forking skills or writing new ones from examples. Community contribution guidelines are minimal, and the agent orchestration logic isn't well-documented for developers who want to extend or modify behavior. If you need to integrate Feynman into existing workflows or build custom research pipelines, expect to read source code and experiment with undocumented patterns.

Verdict

Use if: You're conducting systematic research requiring evidence synthesis across academic papers, codebases, and technical documentation—especially literature reviews, claim verification, or experiment replication where citation accuracy matters more than writing polish. The multi-agent peer review simulation is valuable when you need adversarial checking of research conclusions. Also consider it if you're building agent-based tools and want to integrate research capabilities via the skills library. Skip if: You need lightweight Q&A without external service dependencies, work in environments that restrict Docker or cloud GPU access, or require a mature ecosystem with stable APIs and extensive community plugins. Also skip if you're uncomfortable with installation scripts that download binaries, or if your research primarily involves non-academic sources where AlphaXiv integration provides limited value. For simple web research with citations, Perplexity Pro offers better UX; for custom academic workflows, building on Semantic Scholar's API with LangChain gives you more control.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/getcompanion-ai-feynman.svg)](https://starlog.is/api/badge-click/developer-tools/getcompanion-ai-feynman)