Back to Articles

Cartography: Mapping Multi-Cloud Attack Surfaces with Neo4j Graph Queries

[ View on GitHub ]

Cartography: Mapping Multi-Cloud Attack Surfaces with Neo4j Graph Queries

Hook

When Capital One's 2019 breach exposed 100 million records, investigators needed 47 days to understand the blast radius. The hardest part wasn't finding the vulnerability—it was mapping the relationships between a misconfigured WAF, an IAM role, 700+ S3 buckets, and downstream access patterns across federated identity systems.

Context

Modern infrastructure exists as a tangled web of implicit relationships. An EC2 instance doesn't exist in isolation—it trusts an IAM role, which assumes another role in a different account, which has permissions to access an S3 bucket containing secrets that authenticate to a third-party SaaS platform. These relationship chains are invisible in traditional asset inventory tools that treat resources as isolated rows in a database.

Cartography emerged from Lyft's security engineering team in 2018 to solve exactly this problem. They needed to answer questions like "show me every resource accessible by this compromised OAuth token" or "which EC2 instances can reach the internet and also access our production database?" These queries require traversing multi-hop relationship paths across different cloud services, identity providers, and networking boundaries. Relational databases struggle with this—SQL joins become exponentially complex beyond 3-4 hops. Graph databases like Neo4j are purpose-built for these traversal queries, making "six degrees of separation" questions run in milliseconds. The tool went open-source in 2018, joined CNCF as a sandbox project in 2023, and now supports 30+ platforms including the major clouds, Kubernetes, Okta, CrowdStrike, and even AI platforms like OpenAI.

Technical Insight

Collectors

Raw API Responses

Raw API Responses

Raw API Responses

Cypher MERGE Queries

Remove Old Nodes

CLI Entry Point

Configuration Loader

Sync Engine

AWS Collector

boto3

GCP Collector

Azure Collector

Data Transformer

Normalize to Graph Model

Neo4j Graph DB

Stale Data Cleanup

System architecture — auto-generated

Cartography's architecture centers on a modular sync engine with a three-stage pipeline: collect, transform, load. Each module (AWS, GCP, Azure, etc.) implements collectors that query external APIs using platform-specific SDKs. The AWS module, for example, uses boto3 to enumerate EC2 instances, IAM roles, S3 buckets, and VPC configurations across all enabled regions. These raw API responses flow into transformation logic that normalizes data into a unified node/relationship model, then writes to Neo4j using Cypher queries.

Here's what a typical Cartography sync looks like for AWS EC2 instances:

# Simplified collector pattern from cartography.intel.aws.ec2
def sync_ec2_instances(neo4j_session, boto3_session, regions, current_aws_account_id, update_tag):
    for region in regions:
        # Stage 1: Collect from AWS API
        ec2_client = boto3_session.client('ec2', region_name=region)
        reservations = ec2_client.describe_instances()['Reservations']
        instances = transform_ec2_instances(reservations)
        
        # Stage 2: Transform to graph model
        # Normalize to nodes with relationships
        
        # Stage 3: Load to Neo4j
        load_ec2_instances(neo4j_session, instances, region, 
                          current_aws_account_id, update_tag)

def load_ec2_instances(neo4j_session, instances, region, account_id, update_tag):
    # Upsert nodes using MERGE (create if not exists, update if exists)
    neo4j_session.run(
        """
        UNWIND $instances AS instance
        MERGE (i:EC2Instance{id: instance.instanceid})
        ON CREATE SET i.firstseen = timestamp()
        SET i.instancetype = instance.instancetype,
            i.launchtime = instance.launchtime,
            i.publicipaddress = instance.publicip,
            i.privateipaddress = instance.privateip,
            i.lastupdated = $update_tag
        WITH i, instance
        
        # Create relationship to IAM role
        MATCH (role:AWSRole{arn: instance.iaminstanceprofile})
        MERGE (i)-[r:STS_ASSUMEROLE_ALLOW]->(role)
        ON CREATE SET r.firstseen = timestamp()
        SET r.lastupdated = $update_tag
        """,
        instances=instances,
        update_tag=update_tag
    )

The update_tag timestamp is crucial to Cartography's cleanup strategy. After syncing all resources, a cleanup job runs to delete any nodes with lastupdated older than the current sync timestamp—these represent resources that no longer exist in the source platform. This approach handles eventual consistency gracefully without maintaining separate delete queues.

The real power emerges when querying cross-platform relationships. Want to find every EC2 instance accessible by a compromised Okta user?

// Find attack path from Okta user to EC2 instances
MATCH (user:OktaUser{email: 'compromised@company.com'})
MATCH (user)-[:MEMBER_OF_OKTA_GROUP]->(group:OktaGroup)
MATCH (group)-[:ALLOWS_AWS_ROLE]->(role:AWSRole)
MATCH (role)<-[:STS_ASSUMEROLE_ALLOW]-(instance:EC2Instance)
WHERE instance.publicipaddress IS NOT NULL
RETURN instance.id, instance.publicipaddress, role.name

This single query traverses identity federation (Okta to AWS), IAM role assumption, and network exposure—a question that would require joining data from three different systems using traditional tooling.

Cartography also includes a built-in rules framework for security compliance. Rules are Cypher queries that define desired states and can auto-remediate violations:

# Example: Detect S3 buckets accessible from internet
class S3BucketPublicAccessRule:
    query = """
    MATCH (bucket:S3Bucket)
    WHERE bucket.policy_allows_public = true
       OR bucket.acl_allows_public = true
    RETURN bucket.arn AS resource,
           'S3 bucket allows public access' AS finding
    """
    
    # Optional: Auto-remediation logic
    def remediate(self, bucket_arn):
        # Call AWS API to update bucket policy
        pass

The module system is highly extensible. Organizations frequently write custom modules for internal platforms. Each module just needs to implement the collector/transformer interface and define its graph schema. The Neo4j schema itself is documented through Python dataclasses that generate CREATE/MERGE statements, providing type safety and self-documentation.

One sophisticated pattern Cartography employs is relationship inference. For AWS VPC peering, the tool doesn't just record that Peering Connection X exists—it creates bidirectional CONNECTED_TO relationships between VPCs, allowing network reachability queries without understanding peering topology. These inferred relationships make the graph more intuitive to query for security analysts who aren't infrastructure experts.

Gotcha

The biggest operational challenge is Neo4j itself. Graph databases aren't part of most infrastructure teams' standard toolkit. You need to size Neo4j appropriately (expect 10-50GB for medium environments with 10K+ resources), understand Cypher query optimization (graph traversals can explode without proper LIMIT clauses or relationship filtering), and monitor for long-running queries that lock the database during sync operations. The community edition works for proof-of-concepts, but production deployments often need Neo4j Enterprise for clustering and better performance, which adds licensing costs.

The periodic sync model means your data is always slightly stale. Cartography typically runs hourly or daily syncs, so recent changes won't appear immediately. This is fine for security posture analysis and compliance reporting, but useless for real-time incident response. If someone spins up a misconfigured EC2 instance at 2pm and you sync at midnight, you have a 10-hour blind spot. Some teams run more frequent syncs (every 15 minutes), but this increases API costs and database load. There's also no incremental sync capability—each run pulls full inventories, which becomes expensive at scale with AWS environments containing 50K+ resources. The tool also lacks built-in data retention policies, so your graph grows indefinitely unless you write custom cleanup jobs to archive historical snapshots.

Verdict

Use if: You're managing multi-cloud infrastructure (AWS + GCP + Azure), need to answer complex relationship questions for security investigations ("show me every internet-accessible resource this leaked IAM key can reach"), perform blast radius analysis for vulnerabilities, audit federated identity access paths, or enforce compliance across heterogeneous platforms. It's particularly powerful for organizations with 5+ AWS accounts, multiple cloud providers, or complex identity federation through Okta/Entra ID where native tooling can't bridge the gaps. Skip if: You're single-cloud and AWS Config or Azure Resource Graph already answers your questions, need real-time security monitoring (use cloud SIEM instead), lack resources to operate Neo4j infrastructure, have simple asset inventory needs (Steampipe with SQL is easier), or your team isn't comfortable learning Cypher queries. Also skip if you're a small startup with <100 cloud resources—the operational overhead exceeds the value until you reach meaningful infrastructure complexity.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/cartography-cncf-cartography.svg)](https://starlog.is/api/badge-click/data-knowledge/cartography-cncf-cartography)