HackingBuddyGPT: Teaching LLMs to Find Privilege Escalation Vulnerabilities
Hook
A rainy weekend question—‘Can LLMs hack systems?’—turned into a peer-reviewed ESEC/FSE ‘23 paper exploring how LLMs perform on Linux privilege escalation challenges. The results were promising enough (or disturbing, depending on your perspective) to warrant a full research framework.
Context
Penetration testing remains frustratingly manual. Security professionals spend countless hours poking at systems, running enumeration scripts, cross-referencing vulnerability databases, and piecing together attack chains. Meanwhile, large language models have demonstrated remarkable capabilities in code generation, reasoning, and following multi-step procedures. HackingBuddyGPT emerged from TU Wien’s IPA-Lab to bridge this gap—not by replacing human expertise, but by providing a research platform to systematically evaluate what LLMs can and cannot do in offensive security contexts.
The framework’s origin story matters. Andreas Happe’s weekend experiment yielded promising (or disturbing, depending on your perspective) initial results that led to academic rigor: a published ESEC/FSE ‘23 paper, systematic benchmarks using isolated Linux environments, and empirical comparisons of different LLMs’ effectiveness at privilege escalation tasks. This isn’t speculative AI hype—it’s grounded research with reproducible results. The project was selected for GitHub Accelerator 2024, reflecting growing interest in applying AI to security research. The framework explicitly targets security researchers and academics who need to experiment with LLM-based autonomous agents without building orchestration infrastructure from scratch.
Technical Insight
HackingBuddyGPT’s architecture revolves around a feedback loop: capture system state, feed it to an LLM with context about the objective, execute the LLM’s suggested command, observe results, and repeat. The framework abstracts this pattern into reusable components—connection handlers (SSH or local shell), prompt templates, state management, and result parsing—allowing researchers to create custom ‘use-cases’ (essentially autonomous agents) with minimal boilerplate.
The ‘50 lines of code’ promise isn’t marketing hyperbole—it appears directly in the repository description. A typical use-case inherits from base classes that handle LLM provider integration (OpenAI, Anthropic, local models via Ollama), maintains conversation history, and manages the execution loop. Your custom agent primarily defines the objective, system prompts, and any domain-specific state extraction logic. The framework supports both SSH connections to remote vulnerable VMs and local shell execution for development and testing, though the latter comes with explicit warnings about executing arbitrary LLM-generated commands on your host system.
The benchmarking infrastructure provides research rigor. HackingBuddyGPT integrates with reusable Linux privilege escalation benchmarks—isolated vulnerable environments where success is objectively measurable (did the agent gain root access?). This transforms subjective ‘can LLMs hack?’ questions into quantifiable experiments comparing model performance, prompt engineering strategies, and tool access patterns. The framework captures full transcripts of agent behavior: which commands were attempted, how the LLM interpreted output, and decision-making rationale.
Prompt engineering is critical. The framework allows researchers to experiment with different context presentation strategies—how much command output to include, whether to provide summaries of previous attempts, how to represent file system state. According to the published research paper comparing multiple LLMs, these decisions impact success rates. The paper demonstrates measurable differences in how various models perform on privilege escalation challenges. Cost becomes a practical constraint: privilege escalation attempts often require multiple LLM calls, each processing substantial context about system state.
Extensibility is designed in. While initial research focused on Linux privilege escalation, the architecture aims to support web application testing and API fuzzing use-cases. The connection abstraction separates command execution from LLM orchestration, potentially supporting handlers for web requests, database queries, or cloud API calls.
Gotcha
This is explicitly research software, and the README’s warnings are there for good reason. The framework executes arbitrary commands suggested by LLMs on live systems. In local mode, you’re giving an AI agent shell access to your actual machine. The README explicitly warns: ‘This software will execute commands on live environments. When using local shell mode, commands will be executed on your local system, which could potentially lead to data loss, system modification, or security vulnerabilities.’ Even with SSH to isolated VMs, there’s potential for unintended consequences—the LLM might suggest commands that corrupt data, exhaust resources, or trigger unintended side effects.
Effectiveness varies by LLM choice and task complexity. The published research comparing multiple LLMs shows that performance differs significantly between models on privilege escalation benchmarks. For real-world penetration testing workflows, this creates a practical reality: you can’t reliably depend on autonomous agents to complete objectives without human supervision. The framework is valuable for studying LLM capabilities and limitations, not for automating away the human pentester. Additionally, running extensive experiments gets expensive quickly when using commercial APIs—systematic benchmarking across multiple scenarios and models can accumulate significant costs.
There’s also a security warning in the README about scams: the project explicitly states it’s not involved in any crypto coin, and warns that a Twitter account claiming association is attempting fraud.
Verdict
Use HackingBuddyGPT if you’re a security researcher exploring AI capabilities in offensive security, an academic studying autonomous agent behavior in constrained adversarial environments, or a curious professional pentester who wants to experiment with LLM-augmented workflows in isolated lab settings. The framework excels at enabling systematic, reproducible experiments comparing different LLMs and prompt strategies against standardized benchmarks. Skip it if you need production-ready penetration testing automation, lack dedicated isolated testing infrastructure (VMs or containers you’re willing to potentially compromise), want plug-and-play security tools for client engagements, or expect LLMs to autonomously replace human security expertise. This is fundamentally a research platform—its value proposition is enabling systematic study of what LLMs can do in security contexts, not providing operational tooling. The 50-line simplicity makes it perfect for rapid prototyping and experimentation, but that same simplicity means you’re responsible for understanding both the security implications and the AI safety considerations of autonomous agents with command execution privileges.