VulnBot: Teaching AI Agents to Hack Like a Security Team
Hook
What if your penetration testing team never slept, never forgot a CVE, and could coordinate reconnaissance, exploitation, and reporting in parallel—all while learning from your organization’s past security assessments?
Context
Penetration testing remains one of the most labor-intensive activities in cybersecurity. A typical assessment requires coordinating specialists with different expertise: reconnaissance experts who map attack surfaces, vulnerability researchers who identify weaknesses, exploit developers who chain vulnerabilities into working attacks, and report writers who translate technical findings into business risk. This coordination overhead means most organizations test their systems quarterly at best, leaving months-long windows where new vulnerabilities go undetected.
Traditional automation hasn’t solved this problem because existing tools operate in isolation. Nmap discovers hosts but doesn’t reason about what those discoveries mean. Metasploit exploits vulnerabilities but doesn’t autonomously decide which exploit chain makes sense for a given target. These tools lack the contextual reasoning and collaborative decision-making that human security teams use to conduct sophisticated assessments. VulnBot emerged from academic research asking whether large language models could bridge this gap—not by replacing security tools, but by orchestrating them the way a human team leader coordinates specialists, complete with knowledge sharing, tactical decision-making, and adaptive strategy.
Technical Insight
VulnBot’s architecture centers on specialized agent roles that mirror real penetration testing team structures. Each agent runs as an independent LLM-powered entity with a specific responsibility: reconnaissance agents gather target information, scanning agents identify vulnerabilities, exploitation agents attempt to compromise systems, and reporting agents synthesize findings. The critical innovation is how these agents collaborate through a structured interaction protocol.
The framework implements agent communication through a message-passing system where agents can request information, share findings, and coordinate multi-stage attacks. Here’s a simplified version of how agents interact during a typical workflow:
class PentestAgent:
def __init__(self, role, llm_client, tool_executor, vector_store):
self.role = role # 'recon', 'scanner', 'exploiter', 'reporter'
self.llm = llm_client
self.tools = tool_executor
self.knowledge_base = vector_store
self.interaction_history = []
def execute_task(self, target_info, context_from_agents):
# Retrieve relevant security knowledge using RAG
relevant_docs = self.knowledge_base.similarity_search(
query=f"{self.role} tactics for {target_info}",
k=5
)
# Construct prompt with role-specific instructions and context
prompt = self._build_prompt(
target=target_info,
previous_findings=context_from_agents,
knowledge=relevant_docs
)
# LLM decides what tools to use and how
action_plan = self.llm.generate(prompt)
# Execute actual security tools via Kali Linux
tool_results = self.tools.execute(action_plan['commands'])
# Share findings with other agents
return {
'agent': self.role,
'findings': tool_results,
'next_steps': action_plan['recommendations']
}
The RAG integration through Langchain-Chatchat and Milvus is what elevates this from a glorified script executor to a knowledge-driven system. Before each action, agents query a vector database populated with CVE databases, exploit documentation, past penetration test reports, and security research papers. This grounds the LLM’s reasoning in actual security knowledge rather than relying solely on training data. When a scanner agent identifies an open port running Apache 2.4.49, the RAG system retrieves documentation about CVE-2021-41773 path traversal vulnerabilities specific to that version, which the exploitation agent then uses to craft targeted attacks.
The tool execution layer deserves particular attention because it’s where AI reasoning meets real security infrastructure. VulnBot doesn’t just generate theoretical attack plans—it interfaces with actual Kali Linux installations to run nmap, nikto, sqlmap, and Metasploit. The framework includes safety wrappers that parse tool outputs, handle errors, and feed results back to agents for interpretation:
class KaliToolExecutor:
def execute(self, tool_command):
# Validate command against whitelist
if not self._is_safe_command(tool_command):
return {'error': 'Command not in approved tool list'}
# Execute on Kali Linux (via SSH or container)
result = subprocess.run(
tool_command,
capture_output=True,
timeout=300,
shell=True
)
# Parse tool-specific output formats
parsed_output = self._parse_tool_output(
tool_command.split()[0], # tool name
result.stdout
)
# Store in MySQL for persistence and cross-agent access
self.db.insert_scan_result(parsed_output)
return parsed_output
The multi-agent coordination happens through a configurable interaction limit system. Without constraints, agents could fall into infinite loops where reconnaissance triggers scanning, which suggests more reconnaissance, ad infinitum. VulnBot implements a turn-based system where each assessment phase has a maximum interaction count, and agents must converge on findings within those constraints. This mirrors how human teams operate with time-boxed assessment phases.
MySQL serves as the shared memory system where all agents log discoveries, track which targets have been tested, and record exploitation attempts. This persistence layer means assessments can pause and resume, multiple agents can work in parallel without duplicating effort, and the reporting agent has complete visibility into everything that happened during the test. The database schema tracks relationships between discovered hosts, identified vulnerabilities, exploitation attempts, and successfully compromised systems—essentially building a graph of the attack path that the LLM can reason over when deciding next moves.
Gotcha
The infrastructure requirements are substantial and might surprise developers expecting a pip-install-and-go experience. You need a working Kali Linux environment (VM or container), MySQL database, Milvus vector store, and a Langchain-Chatchat deployment—all before you run your first scan. The documentation assumes familiarity with setting up these components, which means you’re looking at hours or days of configuration depending on your environment. This isn’t a framework you demo on your laptop during a lunch break.
LLM costs and quality variance present operational challenges. Running comprehensive penetration tests generates hundreds or thousands of LLM API calls as agents reason through findings, plan attacks, and coordinate with each other. At current API pricing, a thorough assessment of even a small network could cost $50-200 in LLM fees. More concerning is the non-deterministic nature of LLM reasoning—run the same test twice and you might get different exploitation paths, different vulnerability prioritizations, or different conclusions about system compromise. This makes VulnBot challenging to use for compliance-driven assessments where you need reproducible results and clear audit trails. The system also inherits all the typical LLM failure modes: hallucinated vulnerabilities that don’t exist, misinterpreted tool outputs, and confidently incorrect attack strategies. While RAG helps reduce these issues, it doesn’t eliminate them, and you need human review of findings before acting on them in production environments.
Verdict
Use if: You’re a security researcher exploring AI-driven automation, have the infrastructure chops to deploy and maintain the required stack, work in red team environments where you can validate AI-generated findings before acting on them, or want to augment human penetration testers with AI assistants that handle routine reconnaissance and documentation. This tool shines when you have experienced security practitioners who can supervise the agents and leverage their work as a force multiplier. Skip if: You need production-ready security scanning with deterministic results and compliance reporting, lack the resources to maintain Kali Linux + MySQL + Milvus + Langchain infrastructure, require cost-predictable assessments (LLM API costs scale unpredictably with target complexity), or expect plug-and-play deployment. Traditional frameworks like Metasploit or commercial scanners remain more appropriate for most enterprise security programs until VulnBot matures beyond its current research-oriented state.