Back to Articles

Raven: A Cautionary Tale of LinkedIn Scraping and the Fragility of Reconnaissance Tools

[ View on GitHub ]

Raven: A Cautionary Tale of LinkedIn Scraping and the Fragility of Reconnaissance Tools

Hook

Raven has 797 stars on GitHub but hasn't worked properly since 2018. Yet its architecture teaches us more about reconnaissance engineering than most actively maintained tools.

Context

In the mid-2010s, penetration testers faced a recurring challenge during the reconnaissance phase: manually gathering employee information from LinkedIn to generate potential email addresses for social engineering or password spray attacks. The process was tedious—search Google for "site:linkedin.com company:TargetCorp", click through profiles, copy names and titles into spreadsheets, then manually generate email permutations (firstname.lastname@target.com, flastname@target.com, etc.).

Raven emerged in 2018 as an elegant solution to this workflow. Built in Go, it automated the entire pipeline: scraping LinkedIn profiles via Google dorking, storing employee data in SQLite, and generating email addresses in configurable formats. The tool represented a shift from simple scrapers to stateful reconnaissance frameworks—scan once, export many times with different email formats without re-running the scrape. But Raven's story is less about what it achieved and more about what it reveals: the inherent fragility of tools built on adversarial platforms. LinkedIn has spent years hardening defenses against exactly this type of automation, and Raven's unmaintained codebase is now a fossil of a bygone era in OSINT tooling.

Technical Insight

Raven's architecture demonstrates sophisticated thinking about the reconnaissance workflow. Rather than building a simple scraper, the developers created a three-layer system: collection, storage, and transformation. The tool uses Selenium WebDriver through the agouti library to control Chrome, navigating Google search results for LinkedIn profiles. This browser automation approach was clever—it could execute JavaScript and appear more human-like than raw HTTP requests.

The data flow is elegant. After scraping, employee information lands in SQLite with a schema separating raw profile data from generated artifacts. Here's the core insight: by decoupling data collection from email generation, Raven allowed penetration testers to iterate on email formats without triggering LinkedIn's rate limits again. The SQLite schema likely looked something like this:

// Simplified representation of Raven's data model
type Employee struct {
    ID        int
    FirstName string
    LastName  string
    Position  string
    Company   string
    LinkedIn  string
    ScanID    int
}

type EmailFormat struct {
    Pattern string // e.g., "{first}.{last}@{domain}"
}

// Generation happens at export time, not scrape time
func GenerateEmails(employees []Employee, format EmailFormat, domain string) []string {
    var emails []string
    for _, emp := range employees {
        email := strings.ReplaceAll(format.Pattern, "{first}", strings.ToLower(emp.FirstName))
        email = strings.ReplaceAll(email, "{last}", strings.ToLower(emp.LastName))
        email = strings.ReplaceAll(email, "{domain}", domain)
        emails = append(emails, email)
    }
    return emails
}

The interactive shell interface, built with the readline library, provided a familiar command structure. Commands like new_scan, config, and export allowed organizing multiple reconnaissance operations. The export command supported formats like {first}.{last}, {f}{last}, {first}{last} with a single keystroke, generating hundreds of email permutations instantly.

What truly elevated Raven was its HaveIBeenPwned integration. After generating email addresses, the tool could query HIBP's API to identify which emails appeared in known data breaches. This feature transformed a simple OSINT tool into a prioritization engine—penetration testers could focus on employees whose credentials had already leaked, significantly improving the efficiency of social engineering campaigns.

The Go implementation was also strategic. A compiled binary meant no dependency hell for end users—just download the executable and chromedriver, configure the path, and run. Go's concurrency primitives would have enabled parallel scraping of search result pages, though LinkedIn's aggressive rate limiting likely made this less useful in practice. The SQLite choice avoided requiring a database server, keeping the tool portable for fieldwork on isolated networks.

Gotcha

Raven is unmaintained and broken, which isn't just a minor inconvenience—it's a fundamental problem. LinkedIn has spent years hardening against automated scraping. Modern LinkedIn requires authentication for most profile views, implements aggressive bot detection through Cloudflare and proprietary fingerprinting, and serves different HTML structures to suspected bots. The Google dorking approach that Raven relied on now triggers CAPTCHAs frequently and returns fewer results as LinkedIn has improved their robots.txt and dynamic rendering.

The Selenium/Chromedriver dependency creates operational headaches. Chrome updates constantly, breaking compatibility with older chromedriver versions. Managing this version pairing across different operating systems and keeping it updated is maintenance overhead that the penetration tester has to absorb. Additionally, running headless Chrome is resource-intensive and leaves telltale automation fingerprints (navigator.webdriver property, missing plugins, consistent timing patterns) that modern anti-bot systems easily detect. Even if you got Raven running today, LinkedIn would likely block your account within minutes. The legal and ethical concerns are also non-trivial—automated scraping violates LinkedIn's Terms of Service, and depending on jurisdiction and use case, could expose penetration testers to legal liability under the Computer Fraud and Abuse Act or GDPR.

Verdict

Use if: You're studying reconnaissance tool architecture and want to understand stateful OSINT design patterns, or you're building internal tools that scrape less adversarial platforms and need architectural inspiration for separating collection from transformation. Raven's codebase is educational even if non-functional. Skip if: You need working LinkedIn reconnaissance for actual engagements—the tool is dead in the water. Instead, use theHarvester for multi-source OSINT that's actively maintained, explore APIs like Hunter.io or Snov.io for legal employee enumeration, or pivot to GitHub, company directories, and conference speaker lists where anti-automation is less aggressive. The era of simple LinkedIn scraping is over; modern reconnaissance requires either paying for legitimate data services or accepting far more sophisticated evasion techniques than Raven provides.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/0x09al-raven.svg)](https://starlog.is/api/badge-click/cybersecurity/0x09al-raven)