Back to Articles

CloudScraper: Finding Cloud Storage Leaks Through Regex-Powered Web Spidering

[ View on GitHub ]

CloudScraper: Finding Cloud Storage Leaks Through Regex-Powered Web Spidering

Hook

Most web scrapers miss the cloud storage URLs buried in minified JavaScript because they're designed to parse clean HTML. CloudScraper finds them by deliberately ignoring best practices.

Context

Cloud storage misconfigurations remain one of the most common security vulnerabilities in modern web applications. Capital One's 2019 breach exposed 100 million customer records through an S3 bucket misconfiguration. Tesla, Uber, and countless others have suffered similar exposures. Yet most reconnaissance tools focus exclusively on AWS S3, leaving Azure Blob Storage and DigitalOcean Spaces largely ignored.

Traditional web scraping approaches use sophisticated HTML parsers like BeautifulSoup or full frameworks like Scrapy. These tools excel at extracting structured data from well-formed markup, following href attributes and parsing DOM trees. But modern web applications frequently embed cloud storage URLs in unconventional locations: inline JavaScript, dynamically constructed strings, configuration objects, or even HTML comments. A parser looking for anchor tags will sail right past a bucket URL constructed with template literals inside a minified React bundle. CloudScraper takes a different approach: treat the entire page as text and match patterns with regex. It's inelegant, occasionally messy, and surprisingly effective for security reconnaissance where recall matters more than precision.

Technical Insight

Parallel Processing

Seed URLs

Fetch page

Raw HTML/JS text

AWS regex

Azure regex

DigitalOcean regex

Response body

Queue discovered links

URL Input

Single or List

Recursive Crawler

Depth-Limited DFS

HTTP Fetcher

Raw Response

Pattern Matcher

Three Cloud Regexes

Link Extractor

Queue New URLs

Cloud Resources

S3/Azure/DO

System architecture — auto-generated

CloudScraper's architecture reveals an intentional trade-off between sophistication and coverage. The core logic fetches raw HTML content and applies regex patterns across the entire response body without attempting to parse structure. Here's the essential pattern matching from the source:

aws_regex = r'[a-zA-Z0-9-\.\-\_]+\.s3\.amazonaws\.com'
azure_regex = r'[a-zA-Z0-9-\.\-\_]+\.blob\.core\.windows\.net'
digitalocean_regex = r'[a-zA-Z0-9-\.\-\_]+\.digitaloceanspaces\.com'

aws_buckets = re.findall(aws_regex, response.text)
azure_blobs = re.findall(azure_regex, response.text)
digitalocean_spaces = re.findall(digitalocean_regex, response.text)

This approach catches resources in contexts that would confuse traditional parsers. Consider a JavaScript configuration object embedded mid-page:

const config = {apiUrl: 'https://api.example.com', assets: 'https://prod-assets.s3.amazonaws.com', fallback: 'backup-bucket.digitaloceanspaces.com/uploads'};

A parser expecting properly quoted href attributes would need JavaScript execution capabilities to extract these URLs. CloudScraper just matches the pattern. The regex doesn't care whether the string appears in valid HTML, malformed markup, comments, or plain text responses.

The recursive crawling mechanism implements a depth-first search with configurable limits. Starting from a seed URL, CloudScraper extracts all links using another permissive regex pattern, adds them to a queue, and processes each page up to the specified depth. The parallel processing implementation uses Python's multiprocessing module to handle multiple target domains simultaneously:

def spider(url, depth, maxdepth, target):
    if depth > maxdepth:
        return
    # Fetch and parse
    links = extract_links(page_content)
    for link in links:
        if is_same_domain(link, target):
            spider(link, depth + 1, maxdepth, target)

The tool maintains a visited URL set to prevent infinite loops, though the implementation is local to each process rather than shared across parallel workers. This means multiple processes might redundantly crawl the same page when processing related targets, but it avoids the synchronization overhead of shared state.

CloudScraper's output strategy focuses on immediate results rather than structured reports. Discovered resources print to stdout in real-time, allowing security researchers to pipe results directly into other tools or monitor progress during long-running scans. This Unix philosophy approach makes it composable:

python cloudscraper.py -u target.com -d 3 | grep 's3.amazonaws' | tee s3-findings.txt

The deliberate simplicity means you can wrap CloudScraper with your own validation logic, integrate it into automation pipelines, or combine it with tools like aws-cli to immediately test bucket permissions. The tool doesn't make assumptions about what you'll do with discovered resources—it just finds them.

One subtle architectural choice involves how CloudScraper handles URL normalization. Unlike enterprise scrapers that deduplicate URLs after normalizing paths and query parameters, CloudScraper treats each unique string as a distinct target. This creates redundant requests but prevents the tool from missing resources served under slightly different URL variations. For reconnaissance, the performance cost of extra requests is acceptable compared to the risk of missing a misconfigured bucket accessible only through a specific parameter combination.

Gotcha

The regex-everything approach generates false positives in predictable ways. CloudScraper will happily report 'example-bucket.s3.amazonaws.com' even if it appears in documentation, error messages, or example code snippets. Minified JavaScript creates particularly messy results—a pattern match might extract 's3MethodName.blob.core.windows.net' where the regex accidentally spans across variable names. You'll spend time filtering noise from results, especially when crawling developer documentation sites or technical blogs that discuss cloud storage.

More critically, CloudScraper only discovers resources—it doesn't validate them. The tool reports a potential S3 bucket URL, but you have no idea if that bucket exists, whether it's publicly accessible, or if it contains anything interesting. You'll need separate tooling to actually test permissions and enumerate contents. This makes CloudScraper the first step in a workflow, not a complete solution. For comprehensive cloud security assessment, you'd typically run CloudScraper for discovery, then pipe results into specialized tools like S3Scanner for validation and content enumeration.

The hardcoded provider list is another practical limitation. Adding support for Google Cloud Storage, Wasabi, Backblaze B2, or any custom cloud provider requires modifying the source code to add new regex patterns. There's no plugin system or configuration file for extending provider coverage. If your target uses Cloudflare R2 or another S3-compatible service with custom domains, CloudScraper won't find it unless you fork the repository and add detection patterns.

Verdict

Use CloudScraper if you're conducting security reconnaissance or bug bounty hunting across multiple targets and need lightweight cloud storage enumeration without framework overhead. It excels at discovering resources in JavaScript-heavy applications where traditional parsers struggle, and the simple output format integrates cleanly into existing security workflows. The parallel processing makes it practical for scanning dozens of domains efficiently. Skip it if you need production-grade accuracy with low false positives—the regex approach trades precision for recall. Also skip if you require built-in validation of discovered resources or need comprehensive cloud provider coverage beyond AWS, Azure, and DigitalOcean. For enterprise security assessments, pair CloudScraper with dedicated validation tools, or consider mature alternatives like cloud_enum that offer mutation-based discovery and broader provider support out of the box.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/jordanpotti-cloudscraper.svg)](https://starlog.is/api/badge-click/developer-tools/jordanpotti-cloudscraper)