Back to Articles

Feynman: Building Research Agents That Actually Show Their Work

[ View on GitHub ]

Feynman: Building Research Agents That Actually Show Their Work

Hook

Most AI research assistants hallucinate citations or cherry-pick results. Feynman takes the opposite approach: every claim it makes links directly to a paper, repository, or documentation URL—because it's built on a skill-based architecture that treats verification as a first-class agent.

Context

If you've ever tried to stay current with machine learning research, you know the problem: thousands of papers publish monthly on arXiv, each with code repositories of varying quality, datasets scattered across Hugging Face and institutional servers, and experimental setups that rarely reproduce cleanly. Manual literature reviews take weeks. Writing survey papers means tracking down dozens of citations, verifying claims against original sources, and hoping you didn't miss a critical reference.

Traditional search tools like Google Scholar or Semantic Scholar help with discovery, but they don't read papers for you, synthesize findings, or attempt to reproduce experiments. General-purpose LLMs can summarize abstracts, but they hallucinate citations and struggle with the systematic rigor that academic research demands. Feynman emerged from this gap—built by the team at Companion to create an autonomous research agent that doesn't just find papers, but audits them, replicates experiments, and grounds every output in verifiable sources.

Technical Insight

External Sources

Pi Agent Runtime

Research Query

Search Papers

Search Web

PDFs + Metadata

URLs + Content

Source-grounded Data

Cross-reference Claims

Validated Info

Draft with Citations

Content to Verify

Final Output

Run Experiments

Results + Logs

Compute Resources

Docker

Modal

RunPod

CLI Interface

Researcher Agent

Reviewer Agent

Writer Agent

Verifier Agent

Shared Context

with Provenance

alphaXiv

Paper Search

Web Search

Exa/Perplexity/Gemini

System architecture — auto-generated

Feynman's architecture centers on the Pi agent runtime, which orchestrates four specialized agents: Researcher, Reviewer, Writer, and Verifier. Unlike monolithic LLM applications, each agent has a narrow responsibility and communicates through a shared context that maintains provenance for every piece of information.

The Researcher agent handles discovery through alphaXiv integration—a specialized paper search API that indexes arXiv with enhanced metadata. When you run a literature review, it doesn't just keyword search; it constructs semantic queries, filters by citation count and recency, then retrieves full PDFs for analysis. For web content, it routes through Exa, Perplexity, or Gemini APIs depending on your configuration, always preserving the source URL for later citation.

Here's what a skill invocation looks like when you want to audit a paper's reproducibility:

import { runSkill } from '@pi-agent/runtime';
import { auditPaper } from '@feynman/research-skills';

const result = await runSkill(auditPaper, {
  paperUrl: 'https://arxiv.org/abs/2304.12345',
  checkDatasets: true,
  attemptReplication: 'local', // or 'modal', 'runpod'
  gpuConfig: {
    type: 'A100',
    memory: '40GB'
  }
});

// result.citations contains direct links to:
// - Original paper PDF
// - GitHub repo (if exists)
// - Dataset URLs on HuggingFace
// - Documentation pages referenced

The Reviewer agent parses this structured output and cross-references claims. If a paper states "trained on Common Crawl," the agent queries the Hugging Face Hub API to verify dataset availability and matches version numbers. If code exists, it inspects requirements.txt and Dockerfiles to identify dependencies. This verification step is what enables source-grounded outputs—the agent can't make a claim without a URL to back it up.

For experiment replication, Feynman leverages computational backends through a unified interface. You can run locally with Docker, or scale to cloud GPUs via Modal or RunPod. The system automatically packages the experiment environment, injects the verified code and data, and streams logs back to your terminal. This design borrows from Hugging Face's ml-intern project but extends it with automatic recipe generation—the Writer agent produces a reproducible workflow document that another researcher (or Feynman itself) can follow.

The skill-based architecture is the real architectural innovation here. Each capability—paper search, web scraping, dataset inspection, GPU deployment—ships as a Pi package. This means you can install Feynman's research skills into existing agent frameworks like Codex or Claude Desktop without adopting the full CLI. The packages expose standard interfaces:

// Skill signature for paper search
export const searchPapers: Skill<{
  query: string;
  maxResults: number;
  filters?: {
    yearStart?: number;
    minCitations?: number;
    categories?: string[];
  };
}, {
  papers: Array<{
    title: string;
    authors: string[];
    url: string;
    abstract: string;
    pdfUrl: string;
  }>;
  sourceQuery: string; // for citation purposes
}>;

This modularity extends to the distribution model. The CLI ships as a standalone native bundle with an embedded Node.js runtime—you don't need npm or Node installed on your system. Updates work through feynman update, which refreshes Pi packages but not the core runtime (you reinstall the bundle for major version upgrades). This makes deployment to research teams straightforward: download one binary, configure API keys, and start running literature reviews.

The Verifier agent closes the loop. After the Writer produces a summary or report, Verifier traces every assertion back to its source citation. Claims without backing links get flagged for human review. This adversarial design between Writer and Verifier mirrors peer review processes—one agent tries to synthesize compellingly, the other demands receipts.

Gotcha

Feynman's dependence on external APIs is both its strength and weakness. alphaXiv access, web search providers (Exa, Perplexity, Gemini), and cloud GPU platforms (Modal, RunPod) introduce costs that scale with usage. A comprehensive literature review touching 50+ papers can rack up search API calls quickly. Rate limits become real constraints during bulk operations—if you're auditing an entire research domain, expect to implement retry logic and backoff strategies that aren't built into the current release.

The local-first story is incomplete. While Feynman supports local LLMs through LM Studio, Ollama, and vLLM, the research workflows still lean heavily on cloud services. Paper PDFs come from arXiv servers, datasets from Hugging Face, and web content from search APIs. If you're working in an air-gapped environment or prioritizing privacy, you'll hit friction immediately. There's no offline mode that caches papers or operates on a local corpus.

The standalone bundle's update mechanism has a sharp edge: feynman update only refreshes Pi packages, not the core application runtime. If a security patch or critical bug fix lands in the base system, you must manually download and reinstall the entire bundle. This split between package-level and runtime-level updates will confuse users expecting a unified upgrade path. The documentation mentions this, but it's easy to miss until you're troubleshooting why a known issue persists after running update.

Verdict

Use if: You're conducting systematic literature reviews in ML/AI research and need citation provenance for every claim, you want to replicate experiments from papers with automated environment setup and cloud GPU access, or you're building custom agent workflows and want pre-built research skills you can compose with other capabilities. The source-grounded approach makes Feynman particularly valuable for writing survey papers, auditing reproducibility in your research domain, or onboarding new team members to a subfield through automated summaries with verifiable links. Skip if: You need a fully offline research tool without external API dependencies, you work outside academic/ML research domains where paper-centric workflows don't apply, or you want a polished GUI experience rather than terminal-based agent orchestration. Teams already using Elicit for systematic reviews or building custom integrations directly against Semantic Scholar API will find Feynman's multi-agent complexity overkill for simpler discovery workflows.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/companion-inc-feynman.svg)](https://starlog.is/api/badge-click/developer-tools/companion-inc-feynman)