Building a Multi-Agent Penetration Testing System with AutoGen and LLMs
Hook
What happens when you give an AI system the ability to autonomously scan networks, research vulnerabilities, and execute exploitation code? The Pentesting-AI project explores this controversial frontier by orchestrating multiple LLM agents into a coordinated hacking team.
Context
Penetration testing has always been a labor-intensive process requiring specialized expertise across multiple domains: reconnaissance, vulnerability analysis, exploit development, and report writing. A single comprehensive security assessment might take days or weeks as human pentesters methodically work through the attack surface. While tools like Metasploit and Nmap have automated portions of this workflow, they still require significant human judgment to interpret results, chain exploits, and adapt to unexpected findings.
The emergence of large language models with reasoning capabilities opened a new possibility: what if AI agents could handle not just individual tasks, but collaborate on complex multi-stage security assessments? Pentesting-AI explores this concept using Microsoft’s AutoGen framework to create specialized agents that mirror how real pentesting teams operate—with different experts handling reconnaissance, exploitation, documentation, and code safety checks. The project attempts to compress hours of manual work into automated workflows while maintaining a human-in-the-loop control mechanism for oversight.
Technical Insight
The architecture centers on AutoGen’s group chat orchestration model, where a manager agent coordinates communication between specialized agents, each powered by Azure OpenAI’s GPT models. The system defines distinct roles that map to traditional penetration testing phases: a Scanner Agent for reconnaissance, an Exploiter Agent for active attacks, a Webpage Fetcher for target enumeration, a Vulnerability Searcher for CVE research, and a Report Writer for documentation.
Here’s how the core agent initialization works:
from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager
scanner_agent = AssistantAgent(
name="Scanner",
system_message="""You are a reconnaissance specialist. Your role is to gather
information about the target using nmap, whois, and other scanning tools.
Provide structured output about open ports, services, and potential attack vectors.""",
llm_config={"model": "gpt-4", "api_key": os.environ["AZURE_OPENAI_KEY"]}
)
exploiter_agent = AssistantAgent(
name="Exploiter",
system_message="""You are an exploitation specialist. Based on discovered
vulnerabilities, propose and execute safe exploitation attempts. Always explain
your reasoning and wait for approval before running destructive commands.""",
llm_config={"model": "gpt-4", "api_key": os.environ["AZURE_OPENAI_KEY"]}
)
user_proxy = UserProxyAgent(
name="Security_Engineer",
human_input_mode="TERMINATE", # Can be ALWAYS, NEVER, or TERMINATE
code_execution_config={"work_dir": "pentesting_workspace", "use_docker": True}
)
groupchat = GroupChat(
agents=[scanner_agent, exploiter_agent, user_proxy],
messages=[],
max_round=50
)
manager = GroupChatManager(groupchat=groupchat, llm_config={"model": "gpt-4"})
The critical design decision here is the code_execution_config with Docker isolation. When agents generate reconnaissance or exploitation commands, they execute inside containers to prevent accidental damage to the host system. This is essential because LLMs can hallucinate dangerous commands or misinterpret target scope.
The workflow operates as a conversation thread where agents “speak” in sequence. The Scanner might propose running nmap -sV target.com, the Code Executor validates the command’s safety, execution happens in Docker, results feed back into the chat, and the Vulnerability Searcher cross-references discovered services against CVE databases. The Exploiter then crafts targeted attacks based on this intelligence.
What makes this approach powerful is the emergent behavior from agent collaboration. Unlike linear automation scripts, agents can debate approaches, request additional reconnaissance when exploitation fails, and adapt strategies based on intermediate results. The Group Chat Manager uses its own LLM to decide which agent should speak next based on conversation context—mimicking how a penetration testing team lead would delegate tasks.
The human-in-the-loop modes provide crucial control gradients. ALWAYS mode requires approval for every command—tedious but safe for production targets. TERMINATE mode runs autonomously until agents signal completion or hit errors. NEVER mode is full automation, appropriate only for sandboxed test environments. This flexibility lets practitioners calibrate risk based on target sensitivity and assessment goals.
One particularly clever pattern is the Code Checker agent that reviews proposed commands before execution. This agent receives the exploitation code and target context, then provides a safety assessment:
code_checker = AssistantAgent(
name="CodeChecker",
system_message="""Review proposed commands for safety issues. Check for:
1. Commands that might affect systems outside the authorized scope
2. Destructive operations without proper safeguards
3. Credentials or sensitive data that might be exposed
Flag concerns and suggest safer alternatives.""",
llm_config={"model": "gpt-4", "api_key": os.environ["AZURE_OPENAI_KEY"]}
)
This creates a multi-layer safety mechanism where exploitation intent must pass through review before execution, similar to peer code review in development teams. However, it’s important to note that this LLM-based safety check is heuristic, not deterministic—it can miss subtle issues or be fooled by creative prompt engineering.
Gotcha
The elephant in the room is reliability and safety. LLMs are probabilistic systems that sometimes hallucinate tools that don’t exist, generate syntactically correct but functionally wrong commands, or misunderstand target scope. I tested similar multi-agent systems and watched them confidently propose exploitation steps for vulnerabilities that didn’t actually exist in the discovered services. The Docker isolation helps contain blast radius, but doesn’t prevent wasted time chasing false positives or the reputational damage of an AI generating overly aggressive scans that trip intrusion detection systems.
The Azure OpenAI dependency creates both cost and availability concerns. A comprehensive penetration test might involve hundreds of agent interactions, each consuming API tokens. For a complex target, you could easily burn through $50-100 in API costs per assessment. More problematically, you’re sending target information and discovered vulnerabilities to Microsoft’s cloud infrastructure, which may violate client confidentiality requirements or compliance frameworks. There’s no mention of local LLM support (Llama, Mistral, etc.) that would keep sensitive reconnaissance data on-premises. The 28-star count and minimal community activity also mean you’re largely on your own for troubleshooting—this isn’t battle-tested code with established best practices and edge case handling.
Verdict
Use if: You’re a security researcher exploring AI-assisted methodologies in controlled lab environments with explicit authorization, you have budget for Azure OpenAI API costs and are comfortable with cloud data processing, you want to experiment with multi-agent orchestration patterns for complex workflows, or you’re building proof-of-concepts to demonstrate AI capabilities to stakeholders. This is an excellent learning platform for understanding how LLM agents can collaborate on technical tasks. Skip if: You need production-ready penetration testing tools with liability coverage and safety guarantees, you’re working with sensitive targets where reconnaissance data cannot leave your infrastructure, you lack the expertise to validate AI-generated exploitation attempts and catch dangerous hallucinations, or you’re looking for cost-effective automation (established tools like Nuclei provide better ROI). This is fundamentally a research prototype that demonstrates possibility, not a replacement for professional penetration testing services or mature commercial tools.