Back to Articles

CALDERA: Building Autonomous Adversary Emulation with MITRE's ATT&CK Framework

[ View on GitHub ]

CALDERA: Building Autonomous Adversary Emulation with MITRE's ATT&CK Framework

Hook

Most security teams test their defenses by running canned scripts in sequence. CALDERA takes a different approach: it observes what it learns from your environment and dynamically decides which attack technique to execute next, mimicking how real adversaries adapt their tactics in real-time.

Context

The cybersecurity industry has long struggled with a fundamental testing problem: how do you validate that your defensive controls actually work against real-world adversary behavior? Traditional penetration testing happens too infrequently and focuses on finding vulnerabilities rather than testing detection capabilities. Meanwhile, running isolated attack scripts doesn't reveal whether your security operations center can detect a multi-stage intrusion where each technique builds on information gathered from previous steps.

MITRE, the organization behind the ATT&CK framework that catalogs adversary tactics and techniques, recognized this gap and built CALDERA as an automated adversary emulation platform. Released as open source, it transforms the ATT&CK framework from a static knowledge base into an executable testing platform. Rather than manually scripting attack sequences, CALDERA lets you define adversary profiles using ATT&CK technique IDs, deploy agents to target systems, and run operations that either follow predetermined playbooks or autonomously chain techniques together based on what they discover. This enables purple team exercises where red and blue teams collaborate to systematically validate detection coverage across the ATT&CK matrix.

Technical Insight

CALDERA's architecture centers on an asynchronous Python backend (using aiohttp) that serves both a REST API and a VueJS-based web interface. The real architectural innovation lies in its plugin system and autonomous planning engine. Everything beyond the core framework—agents, technique implementations, reporting, even the UI itself—is a plugin. This modularity means you can swap out components or add capabilities without touching core code.

The fundamental building blocks are "abilities" (implementations of specific ATT&CK techniques), "adversaries" (collections of abilities grouped by campaign or threat actor), and "operations" (execution instances of an adversary profile). Each ability is defined in YAML with executors for different platforms. Here's a simplified example of an ability that discovers running processes:

- id: 43b3754c-def4-4699-aeea-8f3726f9b9e0
  name: Discover running processes
  description: Identify active processes
  tactic: discovery
  technique:
    attack_id: T1057
    name: Process Discovery
  platforms:
    darwin:
      sh:
        command: ps aux
    linux:
      sh:
        command: ps -ef
    windows:
      psh:
        command: Get-Process | Select-Object ProcessName, Id, Path | ConvertTo-Json
  access:
    min: User

What makes this powerful is the facts and rules system. Abilities can define what facts they require as preconditions and what facts they produce as output. For example, a credential dumping ability might require the fact "elevated.privileges=true" and produce facts like "discovered.credential" with specific usernames and passwords. The planning engine uses these relationships to build attack graphs.

CALDERA includes several planners with different logic strategies. The "batch" planner executes all abilities in an adversary profile sequentially. More interesting is the "atomic" planner, which uses a minimax-style decision tree to select the next ability based on current facts. Here's how you might interact with the planning engine programmatically:

import aiohttp
import asyncio
import json

async def run_operation():
    headers = {'KEY': 'ADMIN123'}  # API key from config
    base_url = 'http://localhost:8888'
    
    # Define operation parameters
    operation = {
        'name': 'Purple Team Exercise',
        'adversary': {'adversary_id': 'hunter'},  # Built-in adversary
        'planner': {'id': 'atomic'},  # Autonomous planner
        'source': {'id': 'basic'},  # Fact source
        'group': 'red',  # Agent group to target
        'auto_close': False,
        'state': 'running',
        'autonomous': 1,  # Fully autonomous
        'obfuscator': 'plain-text',
        'jitter': '2/8'  # Random delay between commands
    }
    
    async with aiohttp.ClientSession() as session:
        # Create operation
        async with session.post(
            f'{base_url}/api/v2/operations',
            headers=headers,
            json=operation
        ) as resp:
            op_data = await resp.json()
            op_id = op_data['id']
            print(f"Started operation: {op_id}")
        
        # Monitor operation progress
        while True:
            await asyncio.sleep(10)
            async with session.get(
                f'{base_url}/api/v2/operations/{op_id}',
                headers=headers
            ) as resp:
                status = await resp.json()
                print(f"Status: {status['state']}, " +
                      f"Links executed: {len(status['chain'])}")
                
                if status['state'] == 'finished':
                    break

The agent architecture deserves attention as well. The default Sandcat agent (written in GoLang for cross-platform support) communicates via HTTP/S with configurable beaconing intervals and jitter. It's a full-featured C2 agent that receives ability commands, executes them using the appropriate executor (PowerShell, cmd, bash, etc.), parses output for facts, and reports results. The modular design means you can write custom agents in any language—the protocol is just HTTP with JSON payloads.

One of CALDERA's most valuable features for defenders is its fact collection and relationship mapping. As operations run, the system builds a knowledge graph of everything discovered: user accounts, network shares, running processes, installed software, lateral movement opportunities. This isn't just logged; it actively influences what happens next. If an ability discovers a domain administrator credential, the planner can automatically schedule privilege escalation or lateral movement techniques that require those credentials. This emergent behavior creates more realistic attack simulations than static playbooks.

The plugin architecture extends beyond just abilities. The Response plugin, for instance, flips the framework for incident response scenarios—letting you deploy response actions across compromised hosts. The ICS/OT plugins add industrial control system attack capabilities. The Training plugin includes a complete certification course. This extensibility means CALDERA grows with your needs without requiring core modifications.

Gotcha

CALDERA's comprehensive feature set comes with real operational overhead. The documentation claims 8GB RAM as minimum, but in practice, you'll want 16GB+ for stable operations, especially when running multiple concurrent operations or using resource-intensive plugins. The web interface can become sluggish with large operation chains (hundreds of executed abilities), and the database can grow quickly if you're not periodically cleaning old operations. This isn't a tool you casually spin up on a laptop for quick tests.

The learning curve is genuinely steep. The built-in training course takes 4-6 hours to complete, and that's practically mandatory before you'll be productive. Understanding the relationship between abilities, adversaries, operations, planners, fact sources, and the various plugin interactions requires hands-on experimentation. The YAML syntax for defining abilities is well-documented but finicky—missing a required field or incorrect indentation breaks things silently. More problematic is that the default agents and many included abilities are well-known to defensive tools. If you're testing against mature security operations, Sandcat will likely be detected immediately, and many Stockpile abilities will trigger alerts. To use CALDERA for realistic red team engagements, you'll need to invest significant time developing custom agents and evasive techniques, which defeats much of the automation benefit. This is fundamentally a purple team and validation tool, not a covert offensive platform out of the box.

Verdict

Use if: You're building a purple team program and need systematic ATT&CK coverage validation, you want autonomous adversary emulation that adapts based on discovered facts rather than static playbooks, you need to justify security tool purchases with concrete detection gap analysis, or you're training security analysts on adversary TTPs with hands-on exercises. CALDERA excels when you have the resources to properly deploy it and the time to learn its complexities—the ROI for organizations serious about detection engineering is substantial. Skip if: You need lightweight tooling for quick security checks (Atomic Red Team is better), you're conducting covert red team engagements against mature defenses without time for significant customization (commercial C2 frameworks are more suitable), you're working in resource-constrained environments that can't dedicate the RAM and compute, or you need something team members can learn in an afternoon. CALDERA is a power tool that requires commitment; there are simpler alternatives for simpler needs.