Batea: Teaching Machines to Think Like Penetration Testers

Hook

When you scan a corporate network and get 5,000 hosts back, which ones deserve your attention? Most security teams rely on gut instinct and grep. Batea taught a machine learning algorithm to replicate that intuition.

Context

Network reconnaissance generates overwhelming amounts of data. Run nmap against a typical enterprise network and you'll get thousands of hosts, tens of thousands of open ports, and hundreds of different service banners. The traditional approach is either exhaustive manual review—impractical at scale—or rigid rule-based filtering that misses context. A web server on port 80 might be completely expected in your DMZ but highly suspicious on your SCADA network.

The security community has long recognized this triage problem. Experienced penetration testers develop an intuitive sense for what's interesting: the single Windows XP machine in a fleet of Windows 10 hosts, the SSH server with an unusually old OpenSSH version, the manufacturing device that suddenly has HTTP enabled. This intuition is fundamentally about detecting anomalies within context. Batea, developed by Delve Labs, formalizes this approach using unsupervised machine learning to automatically score and rank network devices based on how unusual they are relative to their peers in the same scan.

Technical Insight

System architecture — auto-generated

Batea's architecture implements a classic machine learning pipeline: data ingestion, feature engineering, model training, and prediction. The ingestion layer parses nmap's XML output format (or CSV exports) and constructs an internal graph representation of the network. Each host becomes a node with associated properties extracted from the scan data—operating system fingerprints, open ports, service versions, script output.

The feature engineering stage is where Batea's real intelligence lives. The framework converts qualitative network properties into numerical vectors that machine learning algorithms can process. For a basic implementation, this means transforming each host into a feature array where dimensions might represent: number of open ports, presence of specific services (as binary flags), version numbers parsed into comparable integers, and derived metrics like port density or service diversity. Here's a simplified example of how Batea might represent two hosts:

import numpy as np
from sklearn.ensemble import IsolationForest

# Host A: Standard workstation (22 open, ssh v8.2, http v2.4)
host_a_features = np.array([22, 1, 0, 8.2, 2.4, 0])

# Host B: Unusual device (135, 139, 445 open, old SMBv1, no HTTP)
host_b_features = np.array([135, 1, 1, 1.0, 0, 1])

# In reality, feature vectors are much longer
feature_matrix = np.array([host_a_features, host_b_features, ...])

# Train Isolation Forest on all hosts in the scan
model = IsolationForest(contamination=0.1, random_state=42)
model.fit(feature_matrix)

# Get anomaly scores (lower = more anomalous)
scores = model.decision_function(feature_matrix)
host_b_score = scores[1]  # Likely flagged as anomalous

The choice of Isolation Forest as the ML algorithm is particularly clever for this use case. Unlike density-based approaches that struggle in high-dimensional spaces, Isolation Forest works on the principle that anomalies are easier to isolate—they require fewer random partitions to separate from the bulk of data. In network scanning terms, that unusual SMBv1 server will be split off from the cluster of modern hosts very quickly during the tree-building process.

Batea's extensibility shines in its feature engineering layer. The framework allows security teams to define custom feature extractors that encode domain-specific knowledge. Want to flag hosts running services on non-standard ports? Create a feature that calculates the edit distance between observed and expected port numbers. Concerned about outdated TLS versions? Write an extractor that parses SSL script output and scores protocol age. This is where security expertise gets codified into the model:

class CustomPortFeature:
    def extract(self, host):
        """Flag services running on unexpected ports"""
        suspicious_score = 0
        expected_ports = {'http': 80, 'https': 443, 'ssh': 22}
        
        for service in host.services:
            if service.name in expected_ports:
                if service.port != expected_ports[service.name]:
                    suspicious_score += 1
        
        return suspicious_score

One architectural decision worth noting is Batea's support for model persistence. You can train an Isolation Forest on a comprehensive baseline scan of your network, serialize the model to disk, and reuse it for incremental scans. This is particularly valuable for continuous monitoring scenarios where you want to detect new anomalies without retraining from scratch each time. The model captures what "normal" looks like for your specific environment, making the anomaly detection truly context-driven.

The output is elegantly simple: a ranked list of hosts sorted by anomaly score. The top-ranked devices are the statistical outliers that deserve manual investigation. Batea doesn't tell you why something is interesting—that determination still requires human expertise—but it dramatically reduces the haystack you need to search through.

Gotcha

Batea's effectiveness is directly proportional to the richness of your nmap scans. If you're running basic SYN scans without service detection, version enumeration, or OS fingerprinting, you're feeding the model poverty data. The feature engineering layer needs detailed service banners, script output, and operating system details to construct meaningful vectors. Running nmap -sS will give you open ports but not enough signal for the ML algorithm to distinguish interesting anomalies from mundane variations. You need comprehensive scans with flags like -A, -sV, and -O, which are significantly slower and noisier.

The unsupervised nature of Isolation Forest cuts both ways. It's powerful because it doesn't require labeled training data—you don't need to pre-classify hosts as "interesting" or "boring." But this also means the model has no feedback mechanism. If Batea ranks a host highly but manual investigation reveals it's a false positive, there's no built-in way to incorporate that ground truth back into the model. The contamination parameter (what percentage of hosts you expect to be anomalous) requires manual tuning, and getting it wrong means either missing real anomalies or drowning in false positives. In highly heterogeneous networks with legitimate diversity—think a university campus with IoT devices, research clusters, and administrative systems all mixed together—the concept of "anomaly" becomes fuzzy, and Batea may struggle to provide actionable rankings.

Verdict

Use if: You're conducting large-scale network assessments (Class B or C ranges) where manual triage is impossible, you're running comprehensive nmap scans that include service/OS detection, and you need automated prioritization to surface the 1% of hosts that deserve deep investigation. Batea excels at encoding security intuition into reusable models and scales beautifully to networks with thousands of devices. It's particularly valuable for red team operations where you need to quickly identify pivot points and high-value targets. Skip if: You're scanning small networks where manual review is feasible (under 50-100 hosts), you're running minimal nmap scans for performance reasons, or you need deterministic compliance checking against specific vulnerability signatures. For the latter, traditional rule-based tools or nmap NSE scripts will be more appropriate. Also skip if you're in a highly regulated environment where you need to document exactly why a host was flagged—ML anomaly scores are harder to justify in audit reports than rule violations.

Batea: Teaching Machines to Think Like Penetration Testers

Batea: Teaching Machines to Think Like Penetration Testers

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

Batea: Teaching Machines to Think Like Penetration Testers

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Headroom: The Three-Layer Compression Stack That Makes LLM Context Windows 60% Cheaper

GSD Core: Why This Tool Spawns a Fresh AI Context for Every Coding Task

Chipotlai Max: Reverse-Engineering Corporate Chatbots for Free LLM Inference

Running Gemma-4 26B on DGX Spark: Why Speculative Decoding Falls Apart at Scale

Headroom: The Three-Layer Compression Stack That Makes LLM Context Windows 60% Cheaper

GSD Core: Why This Tool Spawns a Fresh AI Context for Every Coding Task

Chipotlai Max: Reverse-Engineering Corporate Chatbots for Free LLM Inference

// CODEBASE INTELLIGENCE

Best for

Skip when