Back to Articles

Robin: Multi-Agent Scientific Discovery from Literature to Therapeutic Candidates

[ View on GitHub ]

Robin: Multi-Agent Scientific Discovery from Literature to Therapeutic Candidates

Hook

What if you could generate, evaluate, and rank therapeutic candidates for a disease using nothing but a disease name and API keys? Robin demonstrates that this workflow is no longer science fiction—it’s a simple Python configuration.

Context

Scientific discovery in biomedicine has traditionally been a painstakingly manual process: researchers spend weeks combing through literature, formulating hypotheses, designing experiments, and evaluating potential therapeutic approaches. The explosion of biomedical publications has made comprehensive literature review increasingly difficult for individual researchers. While large language models have shown promise in scientific tasks, they struggle with the specialized, multi-step reasoning required for drug discovery.

Robin, from Future-House, tackles this challenge by orchestrating multiple specialized AI agents into a coherent scientific discovery pipeline. Rather than asking a single LLM to handle everything from literature search to candidate generation, Robin coordinates domain-specific agents—Crow for literature search, Falcon for hypothesis generation, and Finch for data analysis—into a workflow that mirrors how interdisciplinary research teams actually operate. The system generates experimental assays, ranks them through pairwise comparison, selects the most promising assay, generates therapeutic candidates, and can even incorporate experimental data to refine its recommendations. This multi-agent approach represents a pragmatic middle ground between fully manual research and the aspirational goal of fully autonomous scientific discovery.

Technical Insight

Robin’s architecture centers around a stage-based orchestration pattern that separates concerns between different types of scientific reasoning. The core workflow operates through the RobinConfiguration class, which serves as both the configuration interface and execution coordinator. Users specify their research target and model preferences, and Robin handles the complex choreography of agent interactions:

config = RobinConfiguration(
    disease_name="Chronic Kidney Disease",
    num_queries=10,
    num_assays=25,
    num_candidates=50,
    llm_name="o1-mini",
    futurehouse_api_key="your_futurehouse_api_key_here"
)

This configuration kicks off a sophisticated multi-stage process. The first stage generates experimental assays by having the Crow agent search literature based on LLM-generated queries, then using Falcon to synthesize findings into concrete experimental protocols. Rather than having an LLM directly score these assays, Robin implements pairwise comparison—presenting pairs of assays to the LLM and asking which is more promising. This approach produces more consistent rankings compared to absolute scoring.

The second stage follows an identical pattern but focuses on therapeutic candidates. Using the top-ranked experimental assay as context, Robin generates queries, retrieves relevant literature, and proposes specific therapeutic interventions. These candidates are again ranked through pairwise comparison, yielding a final list of recommendations grounded in scientific literature.

Robin’s integration with LiteLLM is particularly clever from an engineering perspective. Rather than hard-coding OpenAI or Anthropic APIs, it leverages LiteLLM’s unified interface to support multiple providers:

# Robin internally uses LiteLLM, allowing provider flexibility
# Users can specify any LiteLLM-supported model:
config = RobinConfiguration(
    disease_name="Sarcopenia",
    llm_name="claude-3-5-sonnet-20241022",  # Or gpt-4o, o1-mini, etc.
    llm_config={"temperature": 0.7}
)

This abstraction means Robin can take advantage of improvements in any provider’s models without code changes. It also allows researchers to compare results across different LLMs or optimize for cost versus quality.

The output structure demonstrates thoughtful design for reproducibility. Each run creates a timestamped directory containing not just final rankings but the complete provenance: every generated hypothesis, literature review, and pairwise comparison. For example, the experimental_assay_detailed_hypotheses/ folder contains individual text files for each proposed assay, while experimental_assay_ranking_results.csv preserves every comparison decision. This granular record-keeping is crucial for scientific workflows where understanding why a particular candidate was ranked highly matters as much as the ranking itself.

The optional data analysis stage, powered by the Finch agent, shows where Robin aims to go beyond pure hypothesis generation. When experimental data is available, Finch analyzes it and feeds insights back into a second round of therapeutic candidate generation. This creates a feedback loop where computational predictions can be refined by empirical results—though this feature is currently in closed beta, limiting immediate accessibility.

From a software architecture perspective, Robin makes an important tradeoff: it tightly couples to the FutureHouse platform agents rather than providing a generic multi-agent framework. This sacrifices generalizability for domain specificity. Crow is optimized for biomedical literature retrieval, and Falcon understands experimental design terminology and can structure protocols appropriately. This specialization is why Robin can produce useful scientific output rather than generic LLM summaries, but it also means you can’t easily adapt Robin to other research domains without access to equivalent domain agents.

Gotcha

Robin’s most significant limitation is its dependency on proprietary FutureHouse platform agents. You need a FUTUREHOUSE_API_KEY to access Crow and Falcon, and the Finch data analysis component requires closed beta access that must be explicitly requested via platform.futurehouse.org/profile. The README is upfront about this: “Without access, all the hypothesis and experiment generation code can still be run,” but losing the data analysis feedback loop eliminates one of Robin’s most compelling features. This isn’t a standalone open-source tool you can fork and run independently—it’s more accurately described as an open-source orchestration layer on top of platform services.

Cost and rate limits present another practical barrier. The documentation explicitly warns that “full paper-quality runs exceed free tier rate limits,” and examining the workflow explains why. A single run with default parameters (10 queries, 25 assays, 50 candidates) involves hundreds of LLM calls for generation plus pairwise comparisons, which scale quadratically. Combined with the costs of Crow literature searches and Falcon hypothesis generation, a comprehensive run could consume substantial API credits. The provided examples for 10 diseases are helpful precisely because reproducing them all would be expensive for most researchers. The Python 3.12+ requirement is also worth noting—not a dealbreaker, but it excludes users on older stable environments who can’t easily upgrade.

Verdict

Use Robin if you’re a computational biologist or drug discovery researcher who needs to systematically explore therapeutic approaches for a specific disease and has both FutureHouse platform access and budget for substantial API usage. The system excels at generating structured, literature-grounded hypotheses and providing transparent reasoning chains you can audit. It’s particularly valuable if you’re comparing multiple diseases or therapeutic modalities and need consistent evaluation frameworks. The examples directory provides excellent templates for understanding what “good” output looks like, making it useful even for learning about multi-agent scientific workflows. Skip Robin if you need a standalone tool without platform dependencies, are working outside biomedical research domains, or require immediate access to the full data analysis pipeline (the Finch closed beta is a significant access barrier). Also skip it if you’re on a tight budget—experimenting with Robin for multiple diseases will quickly exhaust free tiers. For general-purpose multi-agent orchestration or non-biomedical scientific domains, you’re better served by frameworks like LangGraph where you can build custom agents, despite the higher implementation overhead.

// QUOTABLE

What if you could generate, evaluate, and rank therapeutic candidates for a disease using nothing but a disease name and API keys? Robin demonstrates that this workflow is no longer science fiction...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/future-house-robin.svg)](https://starlog.is/api/badge-click/developer-tools/future-house-robin)