HackingBuddyGPT: Teaching LLMs to Think Like Penetration Testers

Hook

A team at TU Wien published research showing that GPT-4 could autonomously escalate Linux privileges in 73% of test scenarios—and they built the framework in under 50 lines of code per use case.

Context

Penetration testing has always been a game of iteration: probe a system, observe the response, adjust your approach, repeat. The best pentesters maintain mental models of dozens of exploitation techniques, recognizing patterns in configuration files, process listings, and file permissions that hint at privilege escalation paths or lateral movement opportunities. This expertise takes years to develop and is difficult to scale.

Traditional automation tools like Metasploit or custom scripts excel at executing known exploits deterministically, but they lack contextual reasoning. They can't look at a misconfigured sudo rule, connect it to an unusual SUID binary, and improvise a novel escalation chain. Large language models promised to bridge this gap—they've ingested security documentation, CVE databases, and pentesting guides during training. But raw ChatGPT conversations aren't enough; you need structured interaction patterns, command execution capabilities, and feedback loops. HackingBuddyGPT, developed by TU Wien's IPA-Lab and backed by peer-reviewed research presented at FSE'23, provides exactly this: a minimal framework that transforms LLMs into autonomous security agents while keeping the barrier to entry remarkably low.

Technical Insight

The genius of HackingBuddyGPT lies in its simplicity. At its core, it's an event loop that maintains a conversation with an LLM, executing suggested commands and feeding results back. The framework provides two key abstractions: a UseCase base class that defines agent behavior, and connection handlers (SSHConnection or LocalConnection) that safely execute commands in target environments.

Here's what a minimal privilege escalation agent looks like:

from hackingBuddyGPT.usecases import UseCase
from hackingBuddyGPT.capabilities import SSHConnection

class PrivEscAgent(UseCase):
    conn: SSHConnection
    
    def init(self):
        self.max_turns = 15
        self.system_prompt = """
        You are a penetration tester on a Linux system with limited privileges.
        Your goal: escalate to root access.
        Suggest ONE command at a time. Explain your reasoning briefly.
        Format: COMMAND: <cmd>
        """
        
    async def run(self):
        context = f"Current user: {self.conn.run('whoami')}"
        
        for turn in range(self.max_turns):
            # Query LLM with accumulated context
            response = await self.llm.get_response(
                self.system_prompt + "\n" + context
            )
            
            # Extract command from LLM response
            cmd = self.parse_command(response)
            if not cmd:
                break
                
            # Execute and capture output
            result = self.conn.run(cmd)
            context += f"\n\nCOMMAND: {cmd}\nOUTPUT: {result}"
            
            # Check if we achieved root
            if self.conn.run('id -u') == '0':
                return True
        return False

This pattern—prompt, execute, observe, repeat—mirrors how human pentesters work, but compressed into a tight loop. The framework handles OpenAI API authentication, rate limiting, and conversation history management behind the scenes.

What makes this particularly powerful is the state management. The context string grows with each iteration, giving the LLM an expanding view of what's been tried and what the system looks like. The published research shows this conversational memory is critical; single-shot prompts rarely succeed at complex exploitation. The LLM needs to build understanding incrementally, just like a human would when enumerating a new target.

The connection abstraction is equally elegant. SSHConnection wraps paramiko to execute commands remotely, while LocalConnection uses subprocess for local testing. Both implement the same interface, so switching between sandbox testing and real targets requires changing a single line. The SSH implementation is particularly clever—it maintains a persistent session and captures both stdout and stderr, which many exploitation techniques rely on:

# The framework handles session persistence automatically
conn = SSHConnection(
    hostname="target.lab",
    username="lowpriv",
    password="password123"
)

# Commands execute in the same shell session
conn.run("cd /tmp")
result = conn.run("pwd")  # Returns: /tmp

The researchers benchmarked this approach against various LLMs using a standardized set of intentionally vulnerable Linux VMs. GPT-4 achieved 73% success rate on privilege escalation tasks, while GPT-3.5 managed 34%. The difference? GPT-4's superior ability to connect disparate clues—noticing that a writeable systemd service file combined with a specific kernel version created an exploitable condition.

But the framework's minimalism is also strategic. By keeping core abstractions simple, the maintainers enable rapid experimentation. Want to add tool-using capabilities? Extend the command parser to recognize function calls. Need to incorporate vulnerability databases? Inject CVE context into the system prompt. The 50-line constraint forces clarity; there's no sprawling class hierarchy to navigate, just a straightforward event loop you can reason about completely.

Gotcha

The elephant in the room: this framework executes arbitrary commands suggested by an AI model that can hallucinate, misunderstand context, or be manipulated by adversarial inputs. The repository documentation is explicit about this—use isolated VMs, never run on production systems, expect potential data loss. But even in sandboxed environments, there are subtler issues.

LLM reliability remains fundamentally unpredictable. A command sequence that worked perfectly in testing might fail inexplicably when the LLM's response varies slightly due to temperature sampling. The researchers found that even GPT-4 would sometimes "give up" on solvable challenges or pursue obviously dead-end approaches for multiple turns. Unlike traditional tools where you can file a bug report with a reproducible test case, debugging LLM agent failures often feels like reading tea leaves. Was the system prompt too vague? Did the context window fill with irrelevant information? Did the model simply have a bad day?

Cost and latency are practical concerns too. A single privilege escalation attempt might require 10-15 LLM calls at several thousand tokens each. With GPT-4 pricing around $0.03 per 1K prompt tokens, you're looking at dollars per test run, and minutes of wall-clock time. This makes the framework excellent for targeted research scenarios but impractical for the kind of broad scanning that production pentesting requires. You wouldn't use this to check 1,000 servers for misconfigurations; you'd use it to deeply explore one interesting target.

Finally, the research codebase shows its academic origins. Error handling is minimal, logging is sparse, and the documentation assumes significant context about both pentesting and LLM interaction patterns. This is a framework for people who already know what they're doing with both security and AI—not a turnkey solution for less experienced practitioners.

Verdict

Use HackingBuddyGPT if: you're researching autonomous security agents and need a clean foundation for experimentation; you want to benchmark different LLMs on realistic pentesting tasks; you're exploring novel attack automation ideas and value rapid prototyping over production polish; or you're teaching advanced security courses and need a platform that demonstrates AI-assisted exploitation. Skip if: you need deterministic, cost-effective tools for production pentesting workflows; you lack the infrastructure to safely sandbox arbitrary command execution; you expect enterprise-grade error handling and safety guarantees; or you're looking for a turnkey solution rather than a research platform. This is a framework for the security researcher who wants to push boundaries, not the practitioner who needs reliable automation for client engagements.

HackingBuddyGPT: Teaching LLMs to Think Like Penetration Testers

HackingBuddyGPT: Teaching LLMs to Think Like Penetration Testers

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

HackingBuddyGPT: Teaching LLMs to Think Like Penetration Testers

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

Glow: Why Rendering Markdown in the Terminal Shouldn't Require a Browser

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]