Building AI Research Tools with Search-Based Aggregation: Inside Company Researcher
Hook
Company Researcher makes no direct API calls to LinkedIn, Crunchbase, or Twitter—yet it aggregates data from all of them. The entire application is built on a single search API that crawls and extracts information through intelligent querying.
Context
Traditional company research tools follow one of two paths: they either maintain expensive, curated databases (Crunchbase Pro, PitchBook) or build complex scraper infrastructures that constantly break when websites change their HTML. Both approaches are capital-intensive and difficult to maintain. Company Researcher demonstrates a third approach: using an AI-native search engine as an orchestration layer. Instead of scraping LinkedIn directly or paying for Crunchbase API access, it uses Exa.ai’s search API with domain filters, keyword targeting, and live crawling to retrieve the same information. The tool doesn’t store data—it’s a real-time aggregation engine that composes multiple search queries into a unified company profile. This architecture is particularly interesting because it shifts the complexity from scraping and data validation to query design and result aggregation.
Technical Insight
The core architecture revolves around parallel search execution with domain-specific targeting. When you input a company URL, the application fans out multiple Exa API calls simultaneously, each configured to target specific platforms and content types (the README documents 16 distinct data source categories including LinkedIn profiles, Crunchbase data, news coverage, and social media presence).
Based on the API playground examples provided in the README, the LinkedIn founder search appears to use keyword-based search with domain filtering following this pattern:
// Pattern derived from README playground link
const foundersSearch = await exa.search(
"exa.ai founder's Linkedin page:",
{
type: "keyword",
numResults: 5,
includeDomains: ["linkedin.com"]
}
);
The includeDomains parameter restricts results to LinkedIn, while the query string uses natural language to target founder profiles specifically. This pattern repeats across data sources—Crunchbase funding data uses includeDomains: ["crunchbase.com"] with includeText: ["exa.ai"] to ensure results mention the company.
The most sophisticated queries documented in the README use Exa’s summary feature with custom prompts for structured extraction. For funding information, the playground link shows:
// Pattern from README API playground
const fundingData = await exa.search(
"exa.ai Funding:",
{
type: "keyword",
text: true,
numResults: 1,
livecrawl: "always",
summary: {
query: "Tell me all about the funding (and if available, the valuation) of this company in detail. Do not tell me about the company, just give all the funding information in detail. If funding or valuation info is not preset, just reply with one word 'NO'."
},
includeText: ["exa.ai"]
}
);
The livecrawl: "always" parameter forces real-time crawling rather than using cached results, critical for financial data that changes frequently. The summary query appears to act as a post-processing filter, likely using the Anthropic API (required per the README’s environment setup) to extract relevant information from unstructured web pages.
Subpage discovery demonstrates another pattern documented in the playground links:
// Pattern from README
const subpages = await exa.search(
"exa.ai",
{
type: "neural",
text: true,
numResults: 1,
livecrawl: "always",
subpages: 10,
subpageTarget: ["about", "pricing", "faq", "blog"],
includeDomains: ["exa.ai"]
}
);
The subpageTarget parameter tells Exa to specifically crawl common company page types, avoiding irrelevant pages. This is far more efficient than crawling an entire domain and filtering client-side.
The Next.js frontend (built with App Router, TailwindCSS, and TypeScript per the README’s tech stack) appears to orchestrate these calls, with the Vercel AI SDK handling AI integration. The playground links in the README serve double duty as documentation and testing interface, showing exactly how each query is structured.
What makes this architecture notable is its composability—adding a new data source means adding a new Exa query configuration, not building a new scraper.
Gotcha
The tool’s fundamental limitation is that it’s only as good as Exa’s search results. If a company has minimal web presence—common with early-stage startups or B2B companies that don’t publicize funding—you’ll get sparse, incomplete data. The LinkedIn founder search is particularly hit-or-miss; based on the keyword-based approach shown in the playground examples, if a founder’s profile doesn’t explicitly mention the company URL or name, they won’t appear. The tool aggregates whatever Exa returns without apparent validation or cross-referencing sources. If Crunchbase has outdated funding information, that’s what you’ll see. The live crawling feature, while powerful, makes the tool dependent on external websites being accessible; if LinkedIn is rate-limiting or a target page is down, that data section may return incomplete results. This appears to be a demo application showcasing Exa’s capabilities rather than a production-ready tool—the README provides no information about caching, comprehensive error recovery, or handling of edge cases like companies with common names or multiple entities sharing similar URLs. The application requires your own Exa API key (and Anthropic API key), both of which are noted as required in the environment setup.
Verdict
Use Company Researcher if you need a quick starting point for company research and want to understand how AI-native search APIs can replace traditional scraping pipelines. It excels at aggregating publicly available information for well-known companies with strong online presence, and the codebase serves as an excellent reference implementation for building similar aggregation tools with Exa or comparable search APIs. The playground links make it invaluable as a learning resource for understanding search query design patterns (16 different query configurations are documented with live examples). Skip it if you need verified, audit-quality data for investment decisions or competitive intelligence—the search-based aggregation approach means data quality depends entirely on what Exa retrieves from the web. Also skip if you’re researching private companies, regional businesses, or startups without significant web footprints. This is fundamentally a demonstration tool showcasing what’s possible with search-based aggregation, not a replacement for professional research platforms.