PenGym: Teaching Machines to Hack Using Real Exploits, Not Simulations

Hook

Most reinforcement learning frameworks for cybersecurity train agents on toy simulations that bear little resemblance to real networks. PenGym takes a radically different approach: your RL agent actually executes nmap scans and Metasploit exploits against live virtual machines.

Context

The promise of autonomous penetration testing has captivated security researchers for years. Imagine an AI agent that can systematically probe networks, identify vulnerabilities, and chain exploits together—all without human intervention. The challenge? Training such agents requires thousands of iterations against realistic environments, but spinning up and tearing down actual vulnerable infrastructure at that scale is prohibitively expensive and complex.

Existing solutions have taken two paths: pure simulation frameworks like NASim and CyberBattleSim that abstract away the messy reality of actual exploits, or manual pentesting tools like Metasploit that require human expertise. The former is fast but unrealistic—agents learn patterns that don’t transfer to real networks. The latter is realistic but can’t train RL agents that need millions of trial-and-error attempts. PenGym, developed by researchers at Japan Advanced Institute of Science and Technology and KDDI Research, attempts to bridge this gap by creating a Gymnasium-compatible framework where agents execute real pentesting tools against actual (virtualized) vulnerable systems.

Technical Insight

System architecture — auto-generated

PenGym’s architecture is built on three interconnected layers: a Gymnasium-compatible environment wrapper, an action/state translation module, and a backend that can operate in either simulation mode (via NASim) or real execution mode (via CyRIS cyber ranges). This dual-mode design is its most interesting architectural decision—you prototype with fast simulations, then validate against real infrastructure without changing your agent code.

The action space is discrete and deliberately constrained. Rather than exposing the infinite possibility space of actual pentesting commands, PenGym defines specific action types: scanning services, exploiting known vulnerabilities, privilege escalation, and lateral movement. When your agent selects an action like “exploit vsftpd on host 192.168.1.10,” the framework translates this into actual Metasploit commands executed via the PyMetasploit library:

from pymetasploit3.msfrpc import MsfRpcClient

class MetasploitExecutor:
    def __init__(self, server='127.0.0.1', port=55553):
        self.client = MsfRpcClient('password', 
                                   server=server, 
                                   port=port, 
                                   ssl=True)
    
    def exploit_service(self, target_ip, service, port):
        # Map service to actual Metasploit exploit module
        exploit_map = {
            'vsftpd-2.3.4': 'exploit/unix/ftp/vsftpd_234_backdoor',
            'proftpd-1.3.3': 'exploit/unix/ftp/proftpd_133c_backdoor'
        }
        
        exploit = self.client.modules.use('exploit', 
                                         exploit_map[service])
        exploit['RHOSTS'] = target_ip
        exploit['RPORT'] = port
        
        result = exploit.execute(payload='cmd/unix/interact')
        return result  # Session ID if successful, error otherwise

The state representation is equally pragmatic. Rather than exposing raw network packets or command outputs, PenGym provides a structured observation space: a network topology matrix showing which hosts are known, which services have been discovered, which vulnerabilities have been identified, and which hosts have been compromised. This abstraction makes it feasible for standard RL algorithms (DQN, PPO, etc.) to learn meaningful policies without drowning in irrelevant details.

The integration with CyRIS (Cyber Range Instantiation System) handles the infrastructure orchestration. CyRIS uses QEMU/KVM to instantiate entire vulnerable networks based on YAML topology definitions. When your agent starts a training episode, PenGym can automatically clone base VM images, configure network interfaces, and restore clean snapshots between episodes. This automation is critical—manual infrastructure management would make iterative RL training impossible.

The most revealing code is in the demo implementation. Rather than showing a trained RL agent, the repository includes a deterministic agent with hardcoded action sequences:

class DeterministicAgent:
    def __init__(self, action_sequence):
        self.actions = action_sequence
        self.step = 0
    
    def select_action(self, observation):
        if self.step < len(self.actions):
            action = self.actions[self.step]
            self.step += 1
            return action
        return None  # Episode complete

# Hardcoded attack sequence: scan, exploit, escalate
agent = DeterministicAgent([
    {'type': 'scan', 'target': '192.168.1.10'},
    {'type': 'exploit', 'target': '192.168.1.10', 'service': 'vsftpd'},
    {'type': 'privesc', 'target': '192.168.1.10'}
])

This reveals both a strength and a limitation. The framework’s architecture correctly separates the RL environment from the agent implementation, following Gymnasium conventions. But the lack of included trained agents or training scripts suggests the framework is primarily a research platform rather than a complete RL training pipeline. You’ll need to bring your own RL algorithms and invest significant effort in reward shaping, hyperparameter tuning, and training infrastructure.

Gotcha

The setup complexity is significant and poorly documented. You need Ubuntu 20.04 VMs (specifically that version), a working CyRIS installation with QEMU/KVM, Metasploit Framework with RPC enabled, and manual preparation of vulnerable service images. The repository README glosses over these prerequisites, but expect to spend days getting everything configured correctly. The vulnerable services are frozen in time—vsftpd 2.3.4, ProFTPD 1.3.3, and other intentionally backdoored versions from over a decade ago. This makes PenGym excellent for controlled research but disconnected from modern vulnerability landscapes.

The bigger limitation is training scalability. Each episode that runs against real VMs takes orders of magnitude longer than simulation. Network scanning with nmap is slow. Metasploit exploit attempts involve actual network connections and process execution. VM snapshot restoration between episodes adds overhead. Where you might train an agent for millions of episodes in NASim overnight, PenGym’s real execution mode might complete hundreds of episodes per day. This fundamentally limits the complexity of RL agents you can practically train. The framework is best suited for validating agents trained in simulation or for research comparing sim-to-real transfer, not for end-to-end RL training from scratch against real infrastructure.

Verdict

Use if: You’re a cybersecurity researcher investigating sim-to-real transfer for autonomous pentesting, have access to dedicated infrastructure and weeks to invest in setup, or need to validate RL agents against actual exploit frameworks rather than simulations. PenGym provides a unique bridge between abstract RL environments and real pentesting tools that doesn’t exist elsewhere. Skip if: You want rapid prototyping of RL agents (use pure simulations like NASim instead), need production-ready automated pentesting (use Metasploit automation directly), lack virtualization infrastructure, or expect modern vulnerability coverage beyond ancient backdoored services. The framework’s research-oriented design and heavyweight requirements make it impractical for most developers outside academic security labs.

PenGym: Teaching Machines to Hack Using Real Exploits, Not Simulations

PenGym: Teaching Machines to Hack Using Real Exploits, Not Simulations

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

PenGym: Teaching Machines to Hack Using Real Exploits, Not Simulations

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

Sudomy: The Bash Orchestrator That Weaponizes 22 APIs for Subdomain Reconnaissance

vulnx: Query 250,000+ CVEs Like You're Searching Your Email

VHostScan: Fuzzy Logic and Virtual Host Discovery for Penetration Testing

Repokid: Netflix's Battle-Tested Approach to Taming AWS IAM Permission Creep

Sudomy: The Bash Orchestrator That Weaponizes 22 APIs for Subdomain Reconnaissance

vulnx: Query 250,000+ CVEs Like You're Searching Your Email

VHostScan: Fuzzy Logic and Virtual Host Discovery for Penetration Testing

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]