Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2
Hook
Most adversary emulation platforms replay pre-scripted attack sequences. Caldera treats penetration testing as a constraint satisfaction problem, using preconditions and facts to automatically chain techniques based on what each step discovers—like an attacker who actually adapts.
Context
Traditional red team exercises face a reproducibility problem. An operator discovers a credential, uses it to move laterally, finds a misconfigured service, and pivots again—but that entire decision tree lives in their head. When you want to re-run the assessment, validate defenses, or train new analysts, you're stuck manually re-executing commands or writing brittle scripts that break when the environment changes.
MITRE built Caldera to solve this orchestration challenge for their own ATT&CK evaluations. Rather than building yet another command-and-control framework focused on stealth and post-exploitation, they created an adversary emulation platform that treats attack sequences as planning problems. The goal wasn't operational security for actual intrusions—it was creating repeatable, documented, ATT&CK-mapped exercises where the system itself decides which techniques to execute next based on discovered environmental facts. This architectural choice permeates everything: agents are compiled per-operation rather than reused across campaigns, abilities declare their preconditions explicitly, and the entire data model optimizes for documentation and reporting rather than real-time performance.
Technical Insight
Caldera's core innovation is treating adversary emulation as a state machine where each ability (individual ATT&CK technique) mutates the environment's known facts. The planning engine uses a STRIPS-like approach—abilities declare preconditions that must be satisfied before execution and postconditions describing what facts they'll discover or create.
Here's what an ability definition looks like in YAML:
id: 9a30740d-3aa8-4c23-8efa-d51215e8a5b9
name: Scan network for SMB
tactic: discovery
technique:
attack_id: T1018
name: Remote System Discovery
platforms:
windows:
psh:
command: |
$hosts = 1..254 | ForEach-Object {"192.168.1.$_"}
$hosts | ForEach-Object {
if (Test-Connection -Count 1 -Quiet $_) {
if (Test-Path "\\$_\C$") { $_ }
}
}
parsers:
plugins.stockpile.app.parsers.basic:
- source: host.ip.address
edge: has_smb_share
preconditions:
- source: host.ip.address
The precondition declares this ability requires at least one known host IP address to target. The parser extracts discovered IPs with SMB shares as new facts—specifically creating host.ip.address facts with an has_smb_share relationship. These newly discovered facts then unlock abilities requiring SMB-enabled hosts as preconditions, like lateral movement techniques.
The planner evaluates all abilities in the operation's adversary profile, filters to those whose preconditions match existing facts, and selects the next technique to execute. This creates genuinely dynamic attack chains—if credential dumping discovers domain admin credentials, privilege escalation abilities automatically become eligible; if those credentials fail, the planner tries alternative paths.
The agent architecture supports this model through the Builder plugin, which compiles operation-specific executables embedding C2 configuration:
# Simplified from plugins/sandcat/app/sand_api.py
@app.route('/sand/sandcat.go', methods=['POST'])
async def download_sandcat(request):
headers = request.get('headers', {})
platform = headers.get('platform', 'windows')
server = headers.get('server', 'http://localhost:8888')
# Generate per-operation agent with embedded config
payload = await generate_agent(
platform=platform,
server=server,
group=headers.get('group', 'red'),
c2=headers.get('c2', 'HTTP')
)
return web.Response(body=payload)
Each operation can request agents with different C2 protocols (HTTP, TCP, DNS, WebSocket) and group memberships, avoiding the signature accumulation problem where a single reused agent binary becomes heavily signatured. The agents poll for "instructions"—essentially ability commands with their required facts injected as variables.
The facts system is brilliantly simple but powerful. Every ability execution returns structured output parsed into typed facts:
facts:
- trait: host.user.name
value: administrator
source: 5c4dd985-89e3-48a3-89e3-6726c70c9682 # Ability execution ID
- trait: host.user.password
value: P@ssw0rd123
source: 5c4dd985-89e3-48a3-89e3-6726c70c9682
relationships:
- edge: has_password
target: 0 # References the username fact above
This creates a dependency graph where relationships link facts together—a password fact knows which username it belongs to, a process fact knows which host it's running on. Abilities can require specific relationship chains: "give me any user where has_password relationship exists AND that user has_admin_on relationship to the current host."
The plugin system expands this model dramatically. Each plugin defines its own data models, REST API endpoints, and abilities. The Atomic Red Team plugin consumes their entire YAML test library wholesale, mapping Atomic tests to Caldera abilities. The Emu plugin ingests CTID adversary emulation plans. The Debrief plugin generates ATT&CK Navigator layers and detailed reports from operation results.
The architectural flexibility shows in the Response plugin, which inverts the entire framework for incident response—same abilities and agents, but orchestrating remediation instead of compromise. An ability that dumps credentials becomes an ability that rotates them; lateral movement becomes lateral containment.
Gotcha
The YAML file storage backend creates a hard scaling ceiling. Operations are stored as single YAML files containing all abilities, facts, relationships, and execution results. Once an operation hits 500-1000 ability executions—common in multi-day exercises with extensive lateral movement—file I/O becomes the bottleneck. The planner must load the entire operation state, evaluate preconditions against thousands of facts, and write results back. There's no database backend option, no operation sharding, and no incremental fact processing.
Agent operational security is functional but dated by modern red team standards. Sandcat, the default Go agent, has basic string obfuscation and jitter in callback timing, but lacks syscall unhooking, ETW patching, AMSI bypassing, or process injection techniques that commercial C2s now include as table stakes. The agent deployment model—compile per-operation, use once—actually helps avoid signature accumulation, but the techniques themselves are well-known to EDR vendors. If you're testing detection engineering against sophisticated threat actors or actually trying to evade competent blue teams, Caldera agents will get caught. The framework was designed for purple team exercises where detection is validated cooperatively, not bypassed adversarially.
Verdict
Use if: You're running structured purple team programs where repeatable, documented, ATT&CK-mapped exercises matter more than raw offensive capability. The planning system enables realistic attack chains that adapt to environmental discoveries rather than replaying static playbooks, and the plugin ecosystem (SCADA/ICS environments, ML attacks, cloud platforms) provides breadth no competitor matches. It's ideal for compliance-driven adversary emulations, teaching adversary tactics, or building custom security automation where the REST API and plugin architecture let you compose workflows. Skip if: You need operational security against competent defenders for actual red team engagements—the agents are well-signatured and lack modern evasion techniques. Also skip if you're running multi-week operations with extensive lateral movement across hundreds of hosts; the YAML storage backend doesn't scale and you'll hit painful performance walls past ~1000 ability executions per operation. For real offensive work, Cobalt Strike or Mythic remain superior choices.