Building a Multi-Agent Pentesting System with AutoGen: When LLMs Orchestrate Security Workflows

Hook

What if your penetration testing team wasn’t humans passing reports back and forth, but AI agents coordinating through structured conversation—with a human able to intervene at every step, never, or only at the end?

Context

Traditional penetration testing follows a predictable workflow: reconnaissance, vulnerability scanning, exploitation research, command execution, and report generation. Each phase requires switching between tools, consulting vulnerability databases, crafting exploits, and documenting findings. The process is time-consuming and repetitive, yet requires constant human judgment to avoid false positives and dangerous commands.

Pentesting-AI tackles this friction by modeling the entire pentest workflow as a multi-agent system. Rather than building a monolithic tool, it creates specialized AI agents—each an expert in one phase—that collaborate through AutoGen’s group chat mechanism. A Scanner agent handles reconnaissance, a Searcher queries CVE databases, an Exploiter crafts payloads, and a Writer generates reports. The twist? You control how much the system runs autonomously versus requiring your approval at each step. This architecture represents an early experiment in applying LLM-based orchestration to security workflows, where the goal isn’t replacing human pentesters but automating the mechanical parts while preserving human oversight.

Technical Insight

The architecture centers on AutoGen’s group chat pattern, which allows multiple AI agents to communicate in a shared conversation thread managed by a Group Chat Manager. Each agent receives a system prompt defining its role and constraints. For example, the Pentester Scanner agent is configured as an “info-gathering expert” responsible for creating reconnaissance commands, while the Code Checker agent validates those commands before execution. This separation of concerns mimics real security team dynamics where different specialists contribute their expertise.

The system defines three interaction modes that fundamentally change how the User Proxy Agent behaves. When you set INTERACTION_MODE to ALWAYS, the user must approve every agent action. With NEVER, the agents run completely autonomously until completion. The TERMINATE mode only prompts for input when the AI believes it’s finished. This flexibility is critical for security work—you might want full automation for routine scans but human-in-the-loop oversight when the Exploiter agent starts crafting payloads. The README shows this configuration happens at the top of main.py by modifying a simple variable before building the Docker container.

Execution happens through two key agents: the Code Executor Agent runs shell commands locally within the Docker container, while the File Reader Agent retrieves command output. This design isolates dangerous pentest operations inside Docker while giving agents the ability to iterate based on results. For instance, the Scanner might generate an nmap scan, the Executor runs it, the File Reader retrieves output from /groupchat/outputs, and then the Vulnerabilities Searcher queries CVE databases based on discovered services. The agents continue this loop—propose, execute, analyze, iterate—until they’ve exhausted their attack surface or the human intervenes.

The Webpage Fetcher and Webpage Communicator agents add web-specific capabilities beyond traditional command-line pentesting tools. The Fetcher performs HTTP requests and returns HTML, while the Communicator appears to interact with webpages based on HTML reports, performing actions like clicking and filling input fields. This suggests the system may be able to handle multi-step web exploitation scenarios where discovering a vulnerability requires navigating through forms or authenticated sections—something static scanners struggle with.

According to the agent descriptions, a workflow emerges: The Scanner generates reconnaissance commands and the Executor runs them (output saved to /groupchat/outputs). The Vulnerabilities Searcher consumes scan results to query CVE and CAPEC databases for known vulnerabilities. The Exploiter creates exploitation commands based on findings, validated by the Code Checker before the Executor runs them. Finally, the Report Writer summarizes everything and saves findings to /groupchat/reports. The Group Chat Manager orchestrates this sequence, deciding which agent speaks next based on conversation context.

The Docker-first approach isn’t just for convenience. Running AI-generated pentest commands directly on your host system is dangerous—a hallucinating LLM could generate destructive commands or accidentally pivot to unintended targets. By requiring docker build -t pentesting-ai . and docker run -it --name pentesting-ai-container pentesting-ai, the project ensures all command execution happens in an isolated environment. You interact with the container interactively (-it flag), observing agents collaborate while knowing your host system remains protected.

The API configuration requires Azure OpenAI credentials set “at the top of the main.py file.” This dependency on Azure rather than standard OpenAI or local models is significant—it means you’re subject to Azure’s rate limits, pricing, and availability. The README mentions setting “your API key and endpoint, as well as your model name (and your API type if you don’t want to use Azure OpenAI),” indicating some flexibility, though Azure appears to be the default configuration. Each agent request costs tokens, and a full pentest session with multiple agents collaborating could consume significant tokens, making cost management a practical concern for extensive use.

Gotcha

The project’s experimental nature shows in several ways. With only 28 GitHub stars and no version tags or releases, this is clearly an early-stage research project rather than production-ready tooling. The README provides no information about testing, validation of agent outputs, or accuracy rates for vulnerability detection. When you’re relying on LLMs to identify security flaws and craft exploits, the risk of hallucinated vulnerabilities or ineffective exploitation attempts is high. There’s no discussion of how the system handles false positives, validates that exploits actually work, or prevents agents from getting stuck in unproductive conversation loops.

The Azure OpenAI dependency creates both cost and accessibility barriers. Unlike tools that can run with local LLMs or free APIs, this requires an Azure account with OpenAI access provisioned—something not all security professionals or researchers have. The README doesn’t provide guidance on expected token consumption or cost estimation for a typical pentest session. More critically, there’s no fallback or error handling described for API rate limits, which could cause the entire multi-agent workflow to halt mid-assessment. The lack of discussion around prompt injection risks is also concerning—if an agent fetches webpage content containing adversarial text designed to manipulate LLM behavior, could an attacker hijack the agent’s decision-making? These aren’t theoretical concerns; they’re fundamental challenges when applying LLMs to security contexts where adversarial input is the norm.

Verdict

Use if: You’re a security researcher exploring AI-augmented pentesting workflows and want to experiment with multi-agent LLM orchestration patterns. This project offers valuable insights into how AutoGen’s group chat mechanism can model complex security workflows, and the three interaction modes provide a practical framework for balancing automation with human oversight. It’s particularly useful if you already have Azure OpenAI access and want to prototype AI-assisted reconnaissance and vulnerability research phases. The Dockerized approach makes it safe to experiment with AI-generated commands without risking your host system. Skip if: You need reliable, production-ready penetration testing automation for professional engagements. The project’s immaturity, lack of validation mechanisms, and dependence on costly Azure APIs make it unsuitable for anything beyond research and experimentation. Traditional tools like Metasploit, Burp Suite, or OWASP ZAP provide proven, auditable results that clients and compliance frameworks accept—LLM-generated pentest reports don’t yet meet that bar. Also skip if you lack Azure OpenAI credentials or need cost-predictable tooling, as token consumption for multi-agent conversations could become expensive. This is a fascinating proof-of-concept for where AI-driven security tooling might go, but it’s not ready to replace your existing pentest workflow.

Building a Multi-Agent Pentesting System with AutoGen: When LLMs Orchestrate Security Workflows

Building a Multi-Agent Pentesting System with AutoGen: When LLMs Orchestrate Security Workflows

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE

Building a Multi-Agent Pentesting System with AutoGen: When LLMs Orchestrate Security Workflows

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

Claw-Code: The Viral Rust AI Coding Tool Built on Controversy

How Engine Simulator Synthesizes Authentic V8 Rumble from Physics, Not Samples

Pi-Mono: A Production-Ready AI Agent Toolkit That Doesn't Lock You Into One LLM Provider

fwknop: How Single Packet Authorization Makes Your SSH Server Invisible to Port Scanners

Claw-Code: The Viral Rust AI Coding Tool Built on Controversy

How Engine Simulator Synthesizes Authentic V8 Rumble from Physics, Not Samples

Pi-Mono: A Production-Ready AI Agent Toolkit That Doesn't Lock You Into One LLM Provider

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE