Back to Articles

CyberBattleSim: Training RL Agents to Navigate Network Graphs Like an APT

[ View on GitHub ]

CyberBattleSim: Training RL Agents to Navigate Network Graphs Like an APT

Hook

Most reinforcement learning assumes stateless policies, but lateral movement fundamentally requires remembering which credentials you stole three hops ago—and that's where the interesting problems begin.

Context

Autonomous penetration testing sounds like science fiction until you realize it's just graph search with extra steps. The promise of RL-driven offensive security has tantalized researchers for years: train an agent to autonomously pivot through networks, chain credentials, and maximize ownership before defenders detect the breach. But most attempts either use toy environments that teach nothing about real networks, or they simulate actual exploits and immediately face the dual-use problem of essentially publishing an autonomous hacking framework.

CyberBattleSim, originally developed by Microsoft Research and forked here by jcabrale, threads this needle by modeling lateral movement at exactly the right abstraction level. It strips away packet-level simulation, timing dynamics, and real CVE mechanics, reducing the problem to its graph-theoretic essence: you have nodes (machines), edges (exploitable paths), and keys (credentials). Your agent must discover vulnerabilities, remember stolen credentials, and apply them strategically to expand network ownership while a probabilistic defender scans for compromised nodes. It's abstract enough to avoid weaponization concerns but concrete enough to capture the multi-step planning and memory challenges that make autonomous lateral movement genuinely hard for RL algorithms.

Technical Insight

The architecture wraps a Markov Decision Process in the OpenAI Gym interface, but the devil lives in the combinatorial action space. At each timestep, an agent observes a state vector containing owned nodes, discovered vulnerabilities, cached credentials, and network topology. Actions take the form of "exploit vulnerability V on node N using credential C"—and that's where things explode. With 50 nodes, 10 vulnerabilities per node, and 20 cached credentials, you're selecting from 10,000 possible actions. Standard DQN and policy gradient methods choke on this immediately.

Here's what a minimal network definition looks like:

from cyberbattle.simulation import model as m

network = m.Environment(
    network=m.create_network([
        m.NodeInfo(
            services=[m.ListeningService("HTTPS")],
            value=100,
            vulnerabilities=dict(
                ScanPageContent=m.VulnerabilityInfo(
                    description="Webpage vuln",
                    type=m.VulnerabilityType.REMOTE,
                    outcome=m.LeakedCredentials(
                        credentials=[m.CachedCredential(
                            node="WebServer",
                            port="SSH",
                            credential="admin:pass123"
                        )]
                    ),
                    rates=m.Rates(probingDetectionRate=0.1, exploitDetectionRate=0.5)
                )
            )
        )
    ])
)

Notice the explicit modeling of detection rates and credential leakage. When an agent successfully exploits ScanPageContent, it doesn't get immediate access to the WebServer—it gets a cached credential that must be applied in a subsequent action. This two-step process (exploit to get credential, then use credential to access node) is precisely what breaks stateless RL policies.

The observation space is particularly elegant in its handling of partial observability. Agents don't see the full network topology upfront—they discover nodes and vulnerabilities through reconnaissance actions. The state representation uses a sparse matrix for owned nodes, a dictionary for cached credentials, and adjacency lists for discovered edges. This means agents must balance exploration (scanning for new nodes) against exploitation (leveraging known credentials).

The defender operates in parallel with configurable parameters:

defender_config = m.DefenderConfiguration(
    detect_probability=0.2,
    reimage_duration=5,  # timesteps to remediate
    sla_target=0.9  # uptime requirement
)

When a node is detected as compromised, the defender triggers a multi-step reimaging process that temporarily removes the node from the network. This creates genuine adversarial tension: aggressive exploitation increases detection risk, while slow careful attacks may fail to meet episode time limits. The SLA constraint adds another wrinkle—defenders can't just reimage everything simultaneously because uptime requirements prevent wholesale network shutdowns.

The reward function defaults to simple ownership percentage, but the framework supports custom shaping. You might penalize detected actions, reward credential collection separately from node ownership, or create asymmetric payoffs for high-value targets. In practice, reward shaping becomes critical because sparse rewards (only at episode end) lead to catastrophically slow learning on anything larger than trivial chains.

What makes this genuinely useful for RL research is the topology parameterization. You can programmatically generate chain networks (linear A→B→C→D), mesh networks (every node connects to every other), DMZ architectures (external zone, DMZ, internal zone with firewall boundaries), or realistic enterprise topologies with segmented VLANs. Training on chain-10 and evaluating on chain-20 gives you a clean transfer learning benchmark. Training on mesh-10 and evaluating on chain-10 tells you whether your agent learned "credential chaining" as a concept or just memorized a specific topology.

Gotcha

The brutal truth is that sample efficiency is abysmal. The baseline DQN implementation in the included notebooks requires 300-500 episodes to learn a trivial 10-node chain network. That's not a typo—hundreds of full episodes to learn "exploit node 1, use the credential on node 2, repeat." The combinatorial action space murders vanilla RL algorithms, and the included baselines (DQN, random, greedy) don't provide sophisticated solutions. If you want to train on anything approaching realistic network sizes (enterprise networks have thousands of nodes), you'll need hierarchical RL, action space decomposition, or graph neural network policies—none of which are provided.

The abstraction choices also create blind spots. By eliminating temporal dynamics entirely, the simulation can't model timing-based evasion, rate limiting, or the multi-hour credential stuffing attacks that dominate real breaches. Exploits succeed or fail instantly with fixed probabilities. There's no concept of a "loud" vs. "quiet" exploit, no traffic volume considerations, and no IDS signature modeling. Credentials are atomic tokens with no internal structure—you can't model password cracking, privilege escalation paths, or the Kerberos/NTLM protocol attacks that drive real lateral movement. The defender is purely reactive with fixed detection probabilities; it can't learn, adapt, or employ honeypots. These aren't bugs—they're deliberate design choices to keep the problem tractable—but they mean insights won't transfer cleanly to operational offensive security.

Verdict

Use if: You're publishing RL research on hierarchical planning, memory-augmented policies, or transfer learning in graph environments and need a citable benchmark with clean OpenAI Gym semantics. The problem formulation is intellectually honest, the engineering is solid (reproducible experiments, multiple baselines, proper interfaces), and the Microsoft backing gives academic legitimacy. It's also genuinely useful for teaching RL concepts to security practitioners—the domain is intuitive enough that students understand why naive approaches fail. Skip if: You're building practical offensive automation or trying to learn real lateral movement techniques. The abstraction eliminates timing, traffic patterns, and credential protocol complexities that matter operationally. Red teamers should use CALDERA with RL plugins for actual ATT&CK techniques; blue teamers should focus on realistic detection problems like EMBER malware classification. This is a research platform dressed in security flavor, not a security tool with RL capabilities—understand the difference before you clone the repo.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/jcabrale-cyberbattlesim.svg)](https://starlog.is/api/badge-click/developer-tools/jcabrale-cyberbattlesim)