Back to Articles

VulnBot: Teaching AI Agents to Hack Like a Security Team

[ View on GitHub ]

VulnBot: Teaching AI Agents to Hack Like a Security Team

Hook

What if your penetration testing team never slept, never forgot a CVE, and could coordinate reconnaissance, exploitation, and reporting in parallel—all while learning from your organization’s past security assessments?

Context

Penetration testing remains one of the most labor-intensive activities in cybersecurity. A typical assessment requires coordinating specialists with different expertise: reconnaissance experts who map attack surfaces, vulnerability researchers who identify weaknesses, exploit developers who chain vulnerabilities into working attacks, and report writers who translate technical findings into business risk. This coordination overhead means most organizations test their systems quarterly at best, leaving months-long windows where new vulnerabilities go undetected.

Traditional automation hasn’t solved this problem because existing tools operate in isolation. Nmap discovers hosts but doesn’t reason about what those discoveries mean. Metasploit exploits vulnerabilities but doesn’t autonomously decide which exploit chain makes sense for a given target. These tools lack the contextual reasoning and collaborative decision-making that human security teams use to conduct sophisticated assessments. VulnBot emerged from academic research asking whether large language models could bridge this gap—not by replacing security tools, but by orchestrating them the way a human team leader coordinates specialists, complete with knowledge sharing, tactical decision-making, and adaptive strategy.

Technical Insight

Knowledge & Tools

Agent Orchestration Layer

Target Info

Vulnerabilities

Exploit Results

Reasoning

RAG Query

Security Knowledge

Execute Commands

Tool Output

Persist Findings

Historical Data

Target System

Recon Agent

Scanner Agent

Exploit Agent

Reporter Agent

Milvus Vector Store

MySQL Database

Kali Linux Tools

LLM Client

Final Report

System architecture — auto-generated

VulnBot’s architecture centers on specialized agent roles that mirror real penetration testing team structures. Each agent runs as an independent LLM-powered entity with a specific responsibility: reconnaissance agents gather target information, scanning agents identify vulnerabilities, exploitation agents attempt to compromise systems, and reporting agents synthesize findings. The critical innovation is how these agents collaborate through a structured interaction protocol.

The framework implements agent communication through a message-passing system where agents can request information, share findings, and coordinate multi-stage attacks. Here’s a simplified version of how agents interact during a typical workflow:

class PentestAgent:
    def __init__(self, role, llm_client, tool_executor, vector_store):
        self.role = role  # 'recon', 'scanner', 'exploiter', 'reporter'
        self.llm = llm_client
        self.tools = tool_executor
        self.knowledge_base = vector_store
        self.interaction_history = []
    
    def execute_task(self, target_info, context_from_agents):
        # Retrieve relevant security knowledge using RAG
        relevant_docs = self.knowledge_base.similarity_search(
            query=f"{self.role} tactics for {target_info}",
            k=5
        )
        
        # Construct prompt with role-specific instructions and context
        prompt = self._build_prompt(
            target=target_info,
            previous_findings=context_from_agents,
            knowledge=relevant_docs
        )
        
        # LLM decides what tools to use and how
        action_plan = self.llm.generate(prompt)
        
        # Execute actual security tools via Kali Linux
        tool_results = self.tools.execute(action_plan['commands'])
        
        # Share findings with other agents
        return {
            'agent': self.role,
            'findings': tool_results,
            'next_steps': action_plan['recommendations']
        }

The RAG integration through Langchain-Chatchat and Milvus is what elevates this from a glorified script executor to a knowledge-driven system. Before each action, agents query a vector database populated with CVE databases, exploit documentation, past penetration test reports, and security research papers. This grounds the LLM’s reasoning in actual security knowledge rather than relying solely on training data. When a scanner agent identifies an open port running Apache 2.4.49, the RAG system retrieves documentation about CVE-2021-41773 path traversal vulnerabilities specific to that version, which the exploitation agent then uses to craft targeted attacks.

The tool execution layer deserves particular attention because it’s where AI reasoning meets real security infrastructure. VulnBot doesn’t just generate theoretical attack plans—it interfaces with actual Kali Linux installations to run nmap, nikto, sqlmap, and Metasploit. The framework includes safety wrappers that parse tool outputs, handle errors, and feed results back to agents for interpretation:

class KaliToolExecutor:
    def execute(self, tool_command):
        # Validate command against whitelist
        if not self._is_safe_command(tool_command):
            return {'error': 'Command not in approved tool list'}
        
        # Execute on Kali Linux (via SSH or container)
        result = subprocess.run(
            tool_command,
            capture_output=True,
            timeout=300,
            shell=True
        )
        
        # Parse tool-specific output formats
        parsed_output = self._parse_tool_output(
            tool_command.split()[0],  # tool name
            result.stdout
        )
        
        # Store in MySQL for persistence and cross-agent access
        self.db.insert_scan_result(parsed_output)
        
        return parsed_output

The multi-agent coordination happens through a configurable interaction limit system. Without constraints, agents could fall into infinite loops where reconnaissance triggers scanning, which suggests more reconnaissance, ad infinitum. VulnBot implements a turn-based system where each assessment phase has a maximum interaction count, and agents must converge on findings within those constraints. This mirrors how human teams operate with time-boxed assessment phases.

MySQL serves as the shared memory system where all agents log discoveries, track which targets have been tested, and record exploitation attempts. This persistence layer means assessments can pause and resume, multiple agents can work in parallel without duplicating effort, and the reporting agent has complete visibility into everything that happened during the test. The database schema tracks relationships between discovered hosts, identified vulnerabilities, exploitation attempts, and successfully compromised systems—essentially building a graph of the attack path that the LLM can reason over when deciding next moves.

Gotcha

The infrastructure requirements are substantial and might surprise developers expecting a pip-install-and-go experience. You need a working Kali Linux environment (VM or container), MySQL database, Milvus vector store, and a Langchain-Chatchat deployment—all before you run your first scan. The documentation assumes familiarity with setting up these components, which means you’re looking at hours or days of configuration depending on your environment. This isn’t a framework you demo on your laptop during a lunch break.

LLM costs and quality variance present operational challenges. Running comprehensive penetration tests generates hundreds or thousands of LLM API calls as agents reason through findings, plan attacks, and coordinate with each other. At current API pricing, a thorough assessment of even a small network could cost $50-200 in LLM fees. More concerning is the non-deterministic nature of LLM reasoning—run the same test twice and you might get different exploitation paths, different vulnerability prioritizations, or different conclusions about system compromise. This makes VulnBot challenging to use for compliance-driven assessments where you need reproducible results and clear audit trails. The system also inherits all the typical LLM failure modes: hallucinated vulnerabilities that don’t exist, misinterpreted tool outputs, and confidently incorrect attack strategies. While RAG helps reduce these issues, it doesn’t eliminate them, and you need human review of findings before acting on them in production environments.

Verdict

Use if: You’re a security researcher exploring AI-driven automation, have the infrastructure chops to deploy and maintain the required stack, work in red team environments where you can validate AI-generated findings before acting on them, or want to augment human penetration testers with AI assistants that handle routine reconnaissance and documentation. This tool shines when you have experienced security practitioners who can supervise the agents and leverage their work as a force multiplier. Skip if: You need production-ready security scanning with deterministic results and compliance reporting, lack the resources to maintain Kali Linux + MySQL + Milvus + Langchain infrastructure, require cost-predictable assessments (LLM API costs scale unpredictably with target complexity), or expect plug-and-play deployment. Traditional frameworks like Metasploit or commercial scanners remain more appropriate for most enterprise security programs until VulnBot matures beyond its current research-oriented state.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/khenryaegis-vulnbot.svg)](https://starlog.is/api/badge-click/ai-agents/khenryaegis-vulnbot)