Back to Articles

PenGym: Training Reinforcement Learning Agents on Real Penetration Testing Infrastructure

[ View on GitHub ]

PenGym: Training Reinforcement Learning Agents on Real Penetration Testing Infrastructure

Hook

Most reinforcement learning frameworks for cybersecurity train agents in simulated environments. PenGym takes a radically different approach: it executes actual nmap scans and Metasploit exploits against real virtual machines, creating a training ground where agents learn from genuine network responses rather than mathematical approximations.

Context

The gap between simulation and reality has plagued cybersecurity AI research for years. An RL agent that masters a simulated network attack might fail spectacularly against real infrastructure because simulations inevitably simplify complex behaviors—delayed responses, unexpected service configurations, nuanced protocol implementations. PenGym emerged from research collaboration between Japan Advanced Institute of Science and Technology (JAIST) and KDDI Research to address this fundamental limitation.

Rather than building yet another simulation engine, PenGym created a Gymnasium-compatible wrapper around actual penetration testing tools. When an agent decides to scan a network, PenGym translates that abstract action into a real nmap command executed against live Ubuntu VMs. When the agent attempts exploitation, PenGym invokes Metasploit’s RPC interface to launch genuine attack modules. The framework observes actual network responses—connection timeouts, service banners, exploit successes and failures—and translates them back into states and rewards the agent can learn from. This architecture forces agents to confront the messy reality of real networks from day one of training.

Technical Insight

PenGym’s architecture consists of three interconnected layers. At the foundation sits the Cyber Range Instantiation System (CyRIS), which automates the creation of virtual network environments using KVM virtualization. CyRIS uses descriptions from the RangeDB database to provision Ubuntu 20.04 VMs with deliberately vulnerable services like vsftpd-2.3.4, proftpd-1.3.3, and the infamous Apache httpd-2.4.49. These aren’t abstractions—they’re the actual vulnerable binaries that real-world attackers target.

The core innovation lives in the Action/State Module, built as an extension to NASim (Network Attack Simulator). While NASim operates purely in simulation mode, PenGym intercepts action execution and routes it to real infrastructure. The module maintains the Gymnasium API contract—agents call standard step() and reset() methods—but underneath, it’s orchestrating actual security tools. Configuration happens through pengym/CONFIG.yml, where you specify absolute paths and network topology. A typical configuration looks like this:

pengym_source: /home/researcher/PenGym
cyber_range_dir: /home/researcher/cyris/cyber_range
host_mgmt_addr: 192.168.122.1
host_virbr_addr: 192.168.122.1
host_account: researcher
guest_basevm_config_file: /home/researcher/cyris/config/base_vm.yml
scenario_name: medium-multi-site

The reconnaissance phase demonstrates PenGym’s translation layer elegantly. When an agent selects a network scanning action, the Action/State Module invokes python-nmap to execute actual port scans. The framework parses nmap’s XML output, extracts open ports and service banners, and constructs an observation space that matches NASim’s expected format. This bidirectional translation—from abstract RL actions to concrete tool invocations, then from tool output back to RL state representations—is what makes the Gymnasium API compatibility possible.

Exploitation becomes even more interesting. PenGym starts the Metasploit RPC daemon (msfrpcd -P my_password) and communicates through pymetasploit3. When an agent attempts exploitation, PenGym selects an appropriate Metasploit module based on the discovered service, configures RHOST and LHOST parameters, executes the exploit, and monitors for session establishment. A successful exploit might grant a Meterpreter shell, which PenGym then uses for privilege escalation attempts or lateral movement to adjacent network segments.

The framework translates real outcomes into learning signals. If an exploit fails because the target service crashed, the agent receives negative feedback. If network latency causes a timeout, that becomes part of the learning experience. This creates training signals based on real-world behavior—something purely simulated environments may not capture fully.

PenGym’s scenario descriptions mirror NASim’s YAML format but represent actual network configurations. The included demonstration uses the medium-multi-site scenario where agents must compromise hosts across network segments to reach specific objectives. These aren’t theoretical attack graphs; they’re executable penetration tests orchestrated by RL policies.

Gotcha

PenGym’s realism comes with substantial operational burden. The setup process demands multiple interdependent systems working in harmony: CyRIS for range instantiation, KVM for virtualization, Metasploit RPC daemon running continuously, and carefully prepared vulnerable software packages stored in database/resources. The README explicitly lists required vulnerable versions—vsftpd-2.3.4, proftpd-1.3.3, samba-4.5.9, opensmtpd-6.6.1p1, httpd-2.4.49—which you must obtain and prepare manually in advance.

The framework requires Ubuntu 20.04 LTS for all guest VMs, limiting scenario diversity. Modern penetration testing often targets Windows environments, containerized applications, or cloud infrastructure—none of which appear to be currently supported. Configuration requires editing multiple files with absolute paths that must precisely match your filesystem layout. A single mismatch between pengym_source in CONFIG.yml and your actual installation directory can break the entire pipeline.

With 56 GitHub stars, the project appears to be primarily focused on academic research rather than widespread community adoption. Documentation centers on basic setup, and the demonstration uses a single deterministic agent with a pre-defined 16-step action sequence to reach scenario goals. Most critically, operating PenGym requires careful ethical boundaries. The README’s warning deserves emphasis: these are real exploitation tools executing real attacks. Even in your own lab network, a misconfigured agent could cause unintended damage or, if network isolation fails, potentially affect systems beyond your intended scope.

Verdict

Use PenGym if you’re conducting academic research on RL-based penetration testing and need training environments that expose agents to real tool behavior and network complexity, you have the infrastructure to run multiple Ubuntu VMs safely isolated from production networks, and you’re willing to invest significant setup time to prepare vulnerable software packages and configure absolute paths across multiple interdependent systems. The realism PenGym provides—actual nmap scans, genuine Metasploit exploits, real network latency—is invaluable for research that must eventually transfer to real-world security operations. Skip PenGym if you need rapid prototyping (pure simulation tools may offer faster iteration), lack dedicated cybersecurity lab infrastructure with proper isolation, require scenarios beyond Ubuntu environments or the limited pre-configured vulnerable services, or want a framework with extensive documentation and active community support. For most developers exploring RL in cybersecurity, starting with pure simulation tools like NASim makes more sense; graduate to PenGym only when simulation limitations become the bottleneck in your research.

// QUOTABLE

Most reinforcement learning frameworks for cybersecurity train agents in simulated environments. PenGym takes a radically different approach: it executes actual nmap scans and Metasploit exploits a...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/cyb3rlab-pengym.svg)](https://starlog.is/api/badge-click/developer-tools/cyb3rlab-pengym)