Back to Articles

Robin: Building a Multi-Agent System That Generates Drug Discovery Hypotheses

[ View on GitHub ]

Robin: Building a Multi-Agent System That Generates Drug Discovery Hypotheses

Hook

What if an AI system could read thousands of scientific papers, generate therapeutic hypotheses for a disease, and rank them by feasibility—all while you sleep? Robin does exactly that, automating the early stages of drug discovery that typically take researchers weeks.

Context

Drug discovery begins with a research question: What molecules might treat this disease? Answering it requires surveying massive amounts of literature, understanding biological mechanisms, identifying therapeutic targets, and evaluating experimental evidence. A single disease might have thousands of relevant papers published across decades. Human researchers spend weeks manually reviewing literature, synthesizing findings, and generating hypotheses—a bottleneck that slows the path from scientific insight to clinical trials.

Robin emerged from Future House's work on AI-assisted scientific discovery. Rather than building another chatbot that answers questions about papers, they created an autonomous system that executes complete research workflows. Robin combines LLM-powered agents with specialized scientific APIs from the Edison platform—Crow for literature search, Falcon for paper analysis, and Finch for experimental data processing. The result is a pipeline that transforms a disease name into ranked therapeutic candidates with supporting evidence, automating the hypothesis generation phase that kicks off drug discovery projects.

Technical Insight

Robin's architecture revolves around orchestrating multiple specialized agents in a four-stage pipeline. The system doesn't try to make a single LLM do everything; instead, it delegates specific tasks to purpose-built tools and uses LLMs primarily for synthesis and reasoning.

The first stage generates search queries using an LLM prompted with disease context. Rather than executing a single generic search, Robin creates multiple targeted queries—one for experimental assays related to the disease, another for existing therapeutic candidates. This query diversification ensures comprehensive literature coverage. The system uses LiteLLM as an abstraction layer, allowing you to swap between OpenAI, Anthropic, or local models without code changes:

from litellm import completion

# Robin uses LiteLLM for model flexibility
response = completion(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a scientific research assistant."},
        {"role": "user", "content": f"Generate literature search queries for: {disease}"}
    ]
)

# Extract queries from response
queries = parse_search_queries(response.choices[0].message.content)

The second stage executes these queries against Edison's Crow API for literature search and Falcon for paper analysis. This is where Robin's dependency on external services becomes apparent—it's not scraping PubMed or parsing PDFs itself. Instead, it calls specialized APIs that have already indexed scientific literature and built extraction pipelines. The tradeoff is clear: you get production-quality literature search without building that infrastructure, but you're locked into Edison's pricing and availability.

The third stage demonstrates Robin's most interesting architectural decision: pairwise hypothesis ranking. Instead of asking an LLM to score hypotheses on a 1-10 scale (which produces inconsistent results), Robin generates all possible pairs of therapeutic candidates and asks the LLM which is more promising:

def rank_hypotheses(candidates, literature_context):
    """Rank therapeutic candidates using pairwise comparisons"""
    comparisons = []
    
    # Generate all pairs
    for i, candidate_a in enumerate(candidates):
        for candidate_b in candidates[i+1:]:
            prompt = f"""
            Based on the literature review, which therapeutic approach 
            is more promising for this disease?
            
            A: {candidate_a.description}
            Evidence: {candidate_a.supporting_papers}
            
            B: {candidate_b.description}
            Evidence: {candidate_b.supporting_papers}
            
            Answer with 'A' or 'B' and explain your reasoning.
            """
            
            result = completion(model="gpt-4", messages=[{"role": "user", "content": prompt}])
            comparisons.append((candidate_a, candidate_b, result))
    
    # Convert pairwise wins to global ranking
    return compute_elo_rankings(comparisons)

This pairwise approach mirrors how humans actually evaluate research directions—we're better at comparing two specific options than assigning absolute scores. It also reduces position bias (LLMs favor earlier items in lists) and anchoring effects. The computational cost scales as O(n²) with candidate count, which means 10 candidates require 45 comparisons, but the quality improvement justifies this for the typical use case of ranking 5-15 therapeutic targets.

The optional fourth stage integrates experimental data analysis via Edison's Finch API. If you have assay results or gene expression data, Robin can incorporate quantitative evidence alongside literature findings. This modularity—being able to run just the hypothesis generation stages or add experimental analysis—reflects good pipeline design for research workflows where data availability varies.

All outputs are structured and timestamped. Robin generates CSV files with ranked candidates, detailed hypothesis reports in markdown, and complete literature reviews organized by disease and run date. This isn't just about getting an answer; it's about creating an auditable trail that scientists can verify, cite, and build upon. The emphasis on provenance and structured outputs shows Robin was built by people familiar with scientific validation requirements.

Gotcha

Robin's Edison API dependency is both its strength and biggest limitation. The system requires paid API credits for core functionality—literature search, paper analysis, and data processing all hit Edison endpoints. There's no fallback to free alternatives like PubMed or Semantic Scholar. This makes Robin fundamentally inaccessible for unfunded researchers or anyone wanting to experiment without financial commitment. The README mentions Docker as the recommended setup to avoid dependency conflicts, which suggests the local installation path is fragile enough that containerization became necessary.

The pairwise ranking approach, while clever, has scaling challenges the documentation doesn't fully address. With 20 therapeutic candidates, you're making 190 LLM calls for ranking alone. At $0.03 per GPT-4 call (assuming ~1000 tokens per comparison), that's $5.70 just for ranking, plus Edison API costs for the literature search that generated those candidates. For a well-funded lab exploring multiple diseases, this is trivial. For an individual researcher or startup, costs compound quickly. The system also lacks guidance on what to do when pairwise comparisons produce cycles (A beats B, B beats C, C beats A)—a real possibility with LLM judges that have high temperature settings or close candidate quality.

Verdict

Use Robin if you're part of a computational biology team or drug discovery group with budget for API services, need to rapidly generate and rank therapeutic hypotheses for specific diseases, and have the wet-lab capacity to validate promising candidates. It's particularly valuable when you're starting research on a new disease indication and need to survey the landscape efficiently. The system excels at turning "What should we investigate for Disease X?" into a prioritized list with evidence backing each candidate. Skip it if you need free and open-source tools, work outside biomedical research domains, lack experimental validation capabilities, require real-time interactive assistance rather than batch processing, or want to modify the core literature search and analysis components (which are locked behind Edison's APIs). Individual researchers and early-stage startups should explore Elicit.org or build custom pipelines with Semantic Scholar's free API and LangChain before committing to Robin's commercial dependencies.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/future-house-robin.svg)](https://starlog.is/api/badge-click/ai-agents/future-house-robin)