PenGym: Training Reinforcement Learning Agents Against Real Vulnerable Systems
Hook
What if your AI pentester trained by actually exploiting real vulnerable servers instead of playing in a sandbox? That's exactly what happens when reinforcement learning meets live cyber ranges.
Context
Reinforcement learning has made impressive strides in games, robotics, and simulated environments, but cybersecurity presents a unique challenge: the sim-to-real gap. An RL agent that performs perfectly in a simplified network simulator often fails catastrophically when confronted with real systems where exploits timeout, services behave unpredictably, and network noise obscures clear state observations.
Traditional approaches to training autonomous penetration testing agents rely on pure simulation environments like NASim or CybORG, where network interactions are abstracted into clean state transitions and deterministic outcomes. While these simulators enable rapid experimentation, they gloss over the messy reality of actual exploitation—the three-second delay while nmap probes a service, the 60% success rate of a temperamental exploit, the partial information from incomplete scan results. PenGym emerged from cyb3rlab to address this gap by creating a framework that lets RL agents interact with real vulnerable virtual machines while maintaining the standard Gymnasium API that modern RL algorithms expect.
Technical Insight
PenGym's architecture consists of three interlocking components: NASim for state abstraction and simulation scaffolding, CyRIS for automated cyber range creation, and a custom Action/State Module that serves as the critical translation layer between RL agent commands and actual penetration testing tools.
The genius of PenGym lies in how it maintains the Gymnasium interface while executing real-world operations under the hood. When an agent calls env.step(action), that action—represented as an integer in the standard RL fashion—gets mapped through a carefully designed action space into concrete penetration testing commands. For instance, action 45 might translate to "execute nmap service scan on host 192.168.1.10" while action 87 becomes "attempt vsftpd 2.3.4 backdoor exploit against target."
Here's what a basic interaction loop looks like:
import gymnasium as gym
import pengym
# Initialize environment with cyber range configuration
env = gym.make('PenGym-v0',
cyris_config='path/to/range_config.xml',
vm_snapshots='path/to/vulnerable_vms/')
# Reset creates or restores the cyber range to initial state
obs, info = env.reset()
# Agent interacts with REAL vulnerable systems
for step in range(100):
action = agent.select_action(obs) # Your RL agent decides
# This step executes ACTUAL nmap/Metasploit commands
obs, reward, terminated, truncated, info = env.step(action)
if terminated:
print(f"Scenario completed in {step} steps")
break
Behind this clean interface, PenGym's Action/State Module performs sophisticated orchestration. When the agent selects a scanning action, the module SSH's into a designated attacker VM and executes actual nmap commands with appropriate flags. The module captures stdout, parses the semi-structured output to extract discovered services and versions, then translates this information back into the state representation that NASim expects—typically a matrix representing the agent's knowledge about each host's services, vulnerabilities, and access level.
The exploit execution path is even more complex. When the agent attempts an exploit action, PenGym must:
- Verify prerequisites are met (target host known, appropriate service discovered)
- Generate and execute a Metasploit resource script with the specific exploit module
- Wait for execution with appropriate timeout handling
- Parse Metasploit's verbose output to determine success or failure
- If successful, establish the new access level and update the state representation
- Calculate an appropriate reward signal based on the outcome
The framework includes a deterministic agent example that serves dual purposes—validation that the environment works correctly and a baseline for comparing RL algorithms. This agent follows a hardcoded 16-step sequence: network sweep with nmap, targeted service enumeration, exploitation of vsftpd 2.3.4 backdoor on a boundary host, lateral movement through the compromised system, and eventual compromise of high-value targets deeper in the network.
CyRIS integration handles the infrastructure complexity of maintaining reproducible cyber ranges. Rather than manually configuring vulnerable VMs for each experiment, researchers define network topology and vulnerability distribution in XML configuration files. CyRIS interprets these specifications to automatically clone VM templates, configure network interfaces, inject specific vulnerable service versions, and snapshot the resulting environment. This automation is crucial for research reproducibility—the same configuration file will generate functionally identical cyber ranges across different physical infrastructure.
The reward structure deserves special attention. PenGym must translate the messy outcomes of real exploitation attempts into scalar reward signals suitable for RL training. Successfully compromising a new host yields positive reward proportional to its value in the network topology. Failed exploit attempts incur small negative rewards to encourage efficiency. The framework tracks which hosts have been compromised and adjusts rewards to prevent agents from repeatedly exploiting the same vulnerability for infinite reward. Sophisticated reward shaping can encode domain knowledge—for instance, penalizing noisy actions that would trigger intrusion detection systems in realistic scenarios.
Gotcha
The most significant limitation is execution speed. While NASim can simulate thousands of attack sequences per second on standard hardware, PenGym is bottlenecked by the actual time required to perform network scans and exploit attempts. A single nmap scan might take 3-10 seconds depending on the target's response characteristics. Metasploit exploit modules can require 30-60 seconds including payload generation, connection establishment, and session handling. This means training iterations that would take minutes in pure simulation expand to hours or days with PenGym.
The infrastructure requirements are substantial and inflexible. PenGym specifically targets Ubuntu 20.04 as the base VM image, and the setup process requires manually preparing vulnerable service packages—vsftpd-2.3.4, proftpd-1.3.3, and other deliberately outdated software with known exploits. These packages must be compiled from source with specific configurations that preserve exploitable vulnerabilities, a process that's both time-consuming and fragile across different system configurations. The documentation warns explicitly that executing real exploits can cause system instability or damage, restricting PenGym to isolated lab networks with no connection to production infrastructure. The legal and ethical constraints cannot be overstated—this framework executes actual exploitation tools, making it unsuitable for anything beyond carefully controlled academic research environments.
Verdict
Use if: You're researching sim-to-real transfer for autonomous penetration testing, need ground truth validation of RL agents against actual vulnerable systems, have the infrastructure and authorization for isolated cyber ranges, and are willing to accept training times measured in hours rather than minutes. PenGym excels at exposing the reality gap that pure simulation glosses over—the timing delays, partial observability, and probabilistic outcomes that define real-world exploitation. Skip if: You're prototyping RL algorithms and need rapid iteration cycles, lack dedicated lab infrastructure with proper isolation, are exploring early-stage research questions better answered by pure simulation, or need to scale training across thousands of episodes. For most RL experimentation in cybersecurity, pure simulators like NASim or CybORG provide 99% of the value at 1% of the complexity. PenGym is the specialized tool for that final validation step where you need to prove your agent actually works against real systems.