Back to Articles

Guardian CLI: When AI Agents Orchestrate Your Penetration Tests

[ View on GitHub ]

Guardian CLI: When AI Agents Orchestrate Your Penetration Tests

Hook

What if your penetration testing toolkit could think about which tool to use next based on what it just discovered? Guardian CLI doesn’t just automate security scans—it orchestrates them using multi-agent AI that adapts its testing strategy in real-time.

Context

Traditional penetration testing automation has always been deterministic: run Nmap, feed results to Nikto, run SQLMap on discovered forms, generate report. Tools like OpenVAS and Metasploit follow predefined workflows that execute regardless of context. If you discover a Jenkins server on port 8080, your automation doesn’t know to pivot and check for CVE-2024-23897. A human pentester would, but your bash script won’t.

This gap between rigid automation and adaptive human reasoning is what Guardian CLI targets. Built by zakirkun, it’s a Python-based orchestration layer that wraps 19 battle-tested security tools (Nmap, Nuclei, SQLMap, Nikto, WPScan, and others) with a multi-agent AI system powered by LangChain. Instead of hardcoded sequences, Guardian uses specialized AI agents—Planner, Tool Selector, Analyst, and Reporter—that collaborate to determine testing strategies based on discovered attack surface. It’s the difference between a robotic checklist and a junior pentester who actually reads tool output before deciding what to run next.

Technical Insight

Execution Layer

LLM-Powered Decision Layer

workflow config

testing strategy

selected tools

raw output + metadata

tool results

parsed findings

context + recommendations

audit trail

YAML Workflow Definition

Planner Agent + LLM

Tool Selector Agent

Tool Executor

Results Analyzer Agent

Evidence Store

System architecture — auto-generated

Guardian’s architecture centers on workflow-driven orchestration with AI decision-making at each phase. The system breaks penetration testing into four distinct agent responsibilities, each backed by an LLM call that reasons about the current state.

The workflow definition lives in YAML files that specify target scope, testing phases, and tool configurations. Here’s what a simplified workflow looks like:

workflow:
  name: "web-app-recon"
  target: "example.com"
  phases:
    - name: "discovery"
      tools:
        - nmap
        - subfinder
      max_duration: 300
    - name: "vulnerability_scan"
      tools:
        - nuclei
        - nikto
      depends_on: "discovery"
  llm_provider: "gemini"
  model: "gemini-pro"
  autonomous: false

The Planner agent receives this workflow and the target specification, then generates a strategic testing plan by querying the LLM with context about the target type, available tools, and discovered services. This isn’t pre-programmed logic—it’s genuine reasoning. The LLM might decide to prioritize subdomain enumeration if the target is a large organization, or focus immediately on web application testing for a single-domain target.

What makes Guardian production-ready is its evidence traceability system. Every tool execution captures not just the results, but the complete command invocation and up to 2000 characters of raw output. This creates an audit trail that’s essential for professional engagements:

# Simplified evidence capture pattern from the codebase
evidence_record = {
    "timestamp": datetime.utcnow().isoformat(),
    "tool": "nmap",
    "command": "nmap -sV -p- example.com",
    "exit_code": 0,
    "output_snippet": raw_output[:2000],
    "findings": parsed_vulnerabilities,
    "next_recommended_tools": ["nuclei", "nikto"]
}

The Tool Selector agent operates with context awareness that traditional automation lacks. After Nmap discovers a WordPress installation on port 443, the agent doesn’t just blindly queue WPScan—it checks if WordPress-specific vulnerabilities were already found by Nuclei, considers whether the client authorized CMS-specific testing in the scope, and evaluates if remaining time budget justifies deep enumeration. This decision tree happens via prompt engineering that feeds tool metadata and current findings to the LLM.

Provider abstraction is implemented through LangChain’s unified interface, allowing seamless switching between OpenAI GPT-4, Anthropic Claude, Google Gemini, and OpenRouter. This isn’t just vendor flexibility—it’s cost optimization. You can configure expensive models like GPT-4 for the Analyst agent that requires nuanced vulnerability correlation, while using faster, cheaper Gemini models for the Tool Selector that makes simpler decisions:

# Configuration example showing per-agent model selection
agents:
  planner:
    provider: "gemini"
    model: "gemini-pro"
    temperature: 0.7
  analyst:
    provider: "openai"
    model: "gpt-4-turbo"
    temperature: 0.3
  reporter:
    provider: "claude"
    model: "claude-3-sonnet"
    temperature: 0.2

The async execution engine manages tool processes with timeout controls and graceful degradation. If SQLMap isn’t installed, Guardian doesn’t crash—the Tool Selector simply removes it from consideration and adapts the plan. The Analyst agent receives structured findings from all successful tools and performs correlation analysis that identifies attack chains: “Port 22 accepts password auth (Nmap) + User enumeration possible (custom script) + Weak password policy detected (Nuclei template) = High-priority credential stuffing opportunity.”

Reporting happens through the Reporter agent that generates multiple output formats (JSON, HTML, Markdown, PDF) with executive summaries and technical details. The LLM contextualizes findings with business impact, something static report generators can’t do: “The exposed Jenkins instance allows unauthenticated access to build logs, potentially leaking AWS credentials used in CI/CD pipelines.”

Gotcha

Guardian is an orchestrator, not a scanner. It requires you to manually install all 19 external security tools it might invoke. There’s no bundled binary or Docker image with everything pre-configured—you’re responsible for having Nmap, Nuclei, SQLMap, Nikto, WPScan, Subfinder, and the rest available in your PATH. For teams managing multiple pentesting machines, this dependency sprawl becomes operational overhead. One missing tool doesn’t break Guardian, but it silently reduces your testing coverage, and you won’t know until you review the workflow results.

The autonomous mode is both Guardian’s biggest strength and its biggest risk. When enabled, the AI agents make all decisions without human approval—tool selection, parameter tuning, test depth. This is powerful for overnight reconnaissance scans, but dangerous for two reasons: cost and scope creep. A complex target could trigger hundreds of LLM API calls as agents iterate through planning cycles. At $0.01 per 1K tokens for GPT-4, a thorough scan might cost $5-20 in API fees alone. More critically, AI agents sometimes hallucinate or misinterpret scope. The codebase includes blacklist validation, but there’s no evidence of built-in rate limiting or budget caps to prevent an autonomous agent from hammering a target beyond authorized intensity. The 1,025 GitHub stars suggest this is a relatively young project—expect rough edges around error handling, token consumption monitoring, and retry logic that production tools develop over years of hardening.

Verdict

Use if: You’re a penetration tester (or security team) conducting authorized assessments who wants AI-assisted decision-making during reconnaissance phases, especially valuable for junior analysts who need guidance on logical next steps. Use it when audit trails and evidence preservation matter for compliance reporting, or when you want to experiment with different LLM providers for cost optimization across testing phases. It’s ideal for teams already comfortable with Python ecosystems and managing multiple security tool dependencies. Skip if: You need deterministic, repeatable scans for compliance checklists where AI variability is a bug not a feature; require air-gapped operation without internet access for LLM APIs; want a single-binary tool without managing 19+ external dependencies; or have budget sensitivity to LLM API costs accumulating during long-running autonomous scans. Also skip if you’re doing unauthorized testing—this is explicitly designed for ethical, authorized penetration testing only.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/zakirkun-guardian-cli.svg)](https://starlog.is/api/badge-click/cybersecurity/zakirkun-guardian-cli)