> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

Sherlock: How 400+ Social Networks Get Queried with Pattern Matching and Concurrent HTTP Requests

[ View on GitHub ]

Sherlock: How 400+ Social Networks Get Queried with Pattern Matching and Concurrent HTTP Requests

Hook

A single username can be checked across 400+ social networks in under 60 seconds. The bottleneck isn't the code—it's the anti-bot measures standing in your way.

Context

Before tools like Sherlock, OSINT investigators faced a grinding manual process: visiting dozens of social networks individually, typing in a target username, and documenting whether accounts existed. For penetration testers conducting reconnaissance or forensic analysts mapping digital footprints, this could consume hours or even days. The problem compounds when you consider the sheer proliferation of social platforms—from giants like Twitter and Instagram to niche communities, regional networks, and specialized forums. Manual enumeration doesn't scale.

Sherlock emerged to solve this scalability problem with a simple premise: maintain a centralized database of social network URL patterns and detection signatures, then automate the checking process. Named after the famous detective (because it hunts down digital evidence), the tool has become a staple in the OSINT community with over 83,000 GitHub stars. It transforms what was once a multi-day investigation into a single command-line execution, democratizing username reconnaissance for security researchers, law enforcement, and anyone needing to map online identities across platforms.

Technical Insight

At its core, Sherlock operates on a surprisingly elegant architecture: a JSON database (resources/data.json) that defines detection patterns for each social network, paired with a concurrent HTTP request engine that queries these platforms in parallel. Each entry in the JSON specifies the URL pattern (with {} as the username placeholder), the HTTP method, and crucially, the detection mechanism for determining whether an account exists.

The detection methods are what make Sherlock interesting from an engineering perspective. Unlike naive implementations that only check HTTP status codes (200 = exists, 404 = doesn't exist), Sherlock supports three detection strategies. The errorType field can be set to status_code for simple cases, but more commonly uses message to analyze response body content. Here's an example configuration for a hypothetical platform:

{
  "ExampleSocial": {
    "errorType": "message",
    "errorMsg": "The page you're looking for isn't here",
    "url": "https://examplesocial.com/{}",
    "urlMain": "https://examplesocial.com/",
    "username_claimed": "blue",
    "username_unclaimed": "noonewouldeverusethisname7"
  }
}

The errorMsg field contains a string or regex pattern to search for in the HTTP response. If this text appears, Sherlock knows the account doesn't exist. The username_claimed and username_unclaimed fields are test cases the maintainers use to validate that detection logic still works as platforms evolve. This becomes critical when sites redesign their error pages or change authentication flows.

The request engine itself uses Python's requests library with concurrent.futures.ThreadPoolExecutor for parallelization. The actual implementation makes HTTP requests across all platforms simultaneously, respecting a configurable timeout (default 60 seconds per request). Here's a simplified version of the core checking logic:

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

def check_username(username, site_data, timeout=60):
    url = site_data['url'].format(username)
    
    try:
        response = requests.get(url, timeout=timeout)
        
        if site_data['errorType'] == 'status_code':
            # Simple status code check
            return response.status_code == 200
        
        elif site_data['errorType'] == 'message':
            # Check if error message is absent (account exists)
            error_msg = site_data.get('errorMsg', '')
            return error_msg not in response.text
            
    except requests.exceptions.RequestException:
        return None  # Uncertain result

def hunt_username(username, site_database, max_workers=20):
    results = {}
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_site = {
            executor.submit(check_username, username, data, 60): site_name
            for site_name, data in site_database.items()
        }
        
        for future in as_completed(future_to_site):
            site = future_to_site[future]
            results[site] = future.result()
    
    return results

This concurrent approach is why Sherlock can query hundreds of sites quickly—it's not waiting for each request to complete serially. The trade-off is network bandwidth and the risk of triggering rate limits, which we'll explore in the gotchas.

Sherlock also implements a pattern-matching feature using {?} syntax in the JSON database. This allows checking username variations automatically. For instance, if a platform allows both john.doe and johndoe formats, the {?} marker tells Sherlock to try both variations, expanding the reconnaissance surface without manual intervention.

The output flexibility deserves mention too. Results can be exported to TXT, CSV, XLSX, or JSON formats, making integration into larger investigation workflows straightforward. For red team operations, the --proxy and --tor flags route traffic through intermediaries, providing operational security when you don't want requests originating from your actual IP address. This thoughtfulness about real-world usage contexts—not just the happy path—demonstrates maturity in the tool's design.

Gotcha

The Achilles' heel of Sherlock is database maintenance. Social networks constantly evolve—they redesign pages, change error messages, implement new bot detection, or shut down entirely. The JSON database relies on community contributions to stay current, which means there's inherent lag between when a site changes and when Sherlock's detection logic updates. You'll encounter false positives (claiming an account exists when it doesn't) and false negatives (missing real accounts) more often than you'd like. Before trusting Sherlock results for critical investigations, manually verify findings on high-priority platforms.

Rate limiting is the second major challenge. Making concurrent requests to 400+ platforms looks like bot behavior because it is bot behavior. Many sites will block your IP, return CAPTCHAs, or throttle responses. The tool includes a --timeout flag and proxy support to mitigate this, but aggressive scanning can still trigger defensive measures. Some platforms actively fingerprint and block known OSINT tools like Sherlock. If you're conducting sensitive investigations, expect to need residential proxies or significant delays between requests, which defeats the speed advantage. Additionally, third-party package distributions (mentioned in the repo docs) are broken on some Linux distributions like Ubuntu 24.04 and ParrotOS, forcing you to install from source rather than using convenient package managers.

Verdict

Use if: You need broad-spectrum reconnaissance across hundreds of platforms quickly, you're conducting OSINT investigations where comprehensive coverage matters more than perfect accuracy, or you're doing initial mapping of a digital footprint where false positives can be manually filtered. Sherlock excels at giving you a wide net to cast before focusing on specific platforms. It's particularly valuable for penetration testing reconnaissance phases, competitive intelligence gathering, or forensic investigations where you need to discover which platforms a person uses. Skip if: You need guaranteed accuracy for legal proceedings (false positives/negatives are too common), you're only investigating 5-10 major platforms where direct API access or manual checking would be more reliable, you require complete stealth (concurrent requests are inherently noisy and detectable), or you're working in time-sensitive scenarios where encountering rate limits or broken site configurations would derail your timeline. For high-stakes investigations, use Sherlock as a discovery tool but validate everything manually.