Back to Articles

Auditing S3 Bucket Exposures: A Security Researcher's Download Tool

[ View on GitHub ]

Auditing S3 Bucket Exposures: A Security Researcher's Download Tool

Hook

Every year, millions of records leak through misconfigured S3 buckets. When you find one during a security audit, how do you systematically document what's exposed without downloading terabytes of irrelevant data?

Context

Amazon S3 bucket misconfigurations remain one of the most common cloud security vulnerabilities. Despite AWS's warnings and tools, companies regularly expose sensitive data by setting buckets to public-read or granting access to 'any authenticated AWS user.' When security researchers, bug bounty hunters, or compliance auditors discover these exposures, they face a documentation challenge: they need to prove what was accessible, download representative samples for analysis, and create an audit trail—all without downloading everything or triggering excessive costs for the bucket owner.

The AWS CLI is optimized for legitimate data management, not security auditing. It requires proper IAM credentials, doesn't preserve raw XML responses that prove public accessibility, and lacks fine-grained pattern matching for selective downloads. aws-s3-downloader fills this niche by operating directly against S3's REST API, treating bucket enumeration as a security assessment activity rather than a data synchronization task. It preserves complete XML listings as evidence, handles AWS's pagination transparently, and provides regex-based filtering to download only files matching specific patterns—critical when you need to prove PII exposure without downloading an entire data lake.

Technical Insight

bucket name, filters

GET /?max-keys=1000&marker=X

XML listing response

extract keys & IsTruncated

matching keys

NextMarker if truncated

GET /object-key

object data

raw XML

CLI Interface

S3 Downloader

S3 REST API

XML Parser

Pattern Filter

Download Queue

Local Storage

System architecture — auto-generated

The tool's architecture bypasses boto3 and the AWS SDK entirely, making raw HTTP requests to S3's REST endpoints. This design choice is deliberate: it demonstrates what any attacker with public bucket access can do without specialized AWS tooling. The core operation starts by requesting the bucket's XML listing with a GET request to https://bucket-name.s3.amazonaws.com/?max-keys=1000. AWS returns an XML document containing up to 1000 object keys, plus a <NextMarker> element if more objects exist.

The pagination logic is straightforward but essential for completeness. After parsing the initial XML response, the tool checks for <IsTruncated>true</IsTruncated> and extracts the last key from the listing. It then makes subsequent requests with the marker parameter set to that key, effectively asking S3 'give me the next 1000 objects after this one.' This continues until IsTruncated is false, ensuring you retrieve every object in the bucket regardless of size:

def download_bucket(bucket_name, start_after=None, include_patterns=None, exclude_patterns=None):
    marker = start_after
    is_truncated = True
    
    while is_truncated:
        url = f"https://{bucket_name}.s3.amazonaws.com/"
        params = {"max-keys": 1000}
        if marker:
            params["marker"] = marker
            
        response = requests.get(url, params=params)
        xml_content = response.content
        
        # Save the raw XML as evidence
        save_xml_listing(xml_content, marker)
        
        root = ET.fromstring(xml_content)
        namespace = {"s3": "http://s3.amazonaws.com/doc/2006-03-01/"}
        
        for content in root.findall("s3:Contents", namespace):
            key = content.find("s3:Key", namespace).text
            
            if should_download(key, include_patterns, exclude_patterns):
                download_object(bucket_name, key)
            
            marker = key  # Update for next iteration
        
        is_truncated = root.find("s3:IsTruncated", namespace).text == "true"

The filtering system uses Python's re module to apply multiple include and exclude patterns. Include patterns are OR'd together (a file matching any include pattern passes), while exclude patterns are AND'd with the includes (a file matching any exclude pattern is rejected). This gives you precise control: --include '\.pdf$' --include '\.xlsx$' --exclude 'test/' downloads all PDFs and Excel files except those in test directories.

One architectural detail worth noting is how the tool handles authentication. For buckets configured with 'any authenticated AWS user' permissions, you can pass AWS credentials that the tool uses to sign requests. This doesn't require the AWS SDK's SigV4 signing logic—the tool uses the requests-aws4auth library to add the proper Authorization header. This distinction matters for security auditors: it demonstrates that having any AWS account (even a free tier one) can access improperly secured corporate data.

The tool also preserves the complete directory structure locally, creating nested folders that mirror the bucket's key prefixes. Since S3 doesn't have real directories—just keys with slashes—this reconstruction helps maintain context. A key like users/2023/january/data.json creates the full users/2023/january/ path before downloading the file. This preservation is crucial when documenting exposure patterns: you can show that an entire user directory tree was publicly accessible, not just individual files.

Error handling reveals the tool's security-research orientation. When download attempts fail (403 Forbidden, 404 Not Found), the tool saves the error XML response instead of skipping the file silently. This behavior creates a complete audit record showing which objects were enumerable but not downloadable—valuable evidence when demonstrating that bucket listing was public even if some individual objects had object-level ACLs.

Gotcha

The most significant limitation is performance. Sequential downloads mean a bucket with 10,000 small files takes hours to download when the AWS CLI could sync it in minutes. There's no connection pooling, no parallel downloads, and no resume capability beyond manually specifying a start marker. If your audit requires downloading gigabytes of data, this tool will frustrate you with its single-threaded approach. The author clearly prioritized correctness and evidence collection over speed, which makes sense for security research but limits practical use cases.

The error handling also has gaps that become apparent in production use. The tool attempts to download every enumerated object regardless of size, so pointing it at a bucket with multi-gigabyte files will hammer your bandwidth and disk without warning. There's no size filtering, no dry-run mode to preview what you're about to download, and no graceful handling of network timeouts or rate limiting. AWS will occasionally return 503 SlowDown responses when you're making too many requests too quickly, and the tool doesn't implement exponential backoff. For serious security work, you'll want to wrap this in a shell script that handles retries and rate limiting externally, or modify the source to add these features yourself.

Verdict

Use if: You're conducting security research or compliance audits on potentially misconfigured S3 buckets and need to document exactly what was publicly accessible, complete with XML evidence. The filtering capabilities make it ideal for targeted downloads—proving that PII exists without downloading an entire data warehouse. It's also valuable for bug bounty work where you need to demonstrate exposure without exceeding scope or causing unnecessary bandwidth costs. Skip if: You're doing legitimate data management with proper credentials (use AWS CLI instead), need high-performance bulk downloads (rclone or s3cmd will be 10-50x faster), or require production-grade reliability with retry logic and parallel downloads. This is a specialized tool for a specific use case: security assessment and exposure documentation, not general-purpose S3 data transfer.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/ucnt-aws-s3-downloader.svg)](https://starlog.is/api/badge-click/developer-tools/ucnt-aws-s3-downloader)