Dynamic Risk Assessment: Teaching AI Hackers When to Stop Before They Cause Real Damage

Hook

What happens when your autonomous penetration testing agent discovers a critical vulnerability in production systems—but can’t tell the difference between a test environment and your company’s live database?

Context

The cybersecurity industry is racing toward autonomous offensive agents powered by large language models. Tools like PentestGPT and LLM-augmented Metasploit modules promise to automate the time-consuming work of penetration testing, vulnerability assessment, and exploit development. But there’s a fundamental problem: these agents are remarkably effective at finding and exploiting vulnerabilities, yet they lack the judgment to assess whether executing a particular command will cause catastrophic damage.

Traditional approaches to AI safety in cybersecurity focus on pre-deployment evaluation—testing models in sandboxed environments before letting them loose. But this static assessment fails to capture the dynamic nature of pentesting operations, where each action changes the system state and opens new attack vectors. Princeton’s Polaris Lab addresses this gap with Dynamic-Risk-Assessment, a NeurIPS 2025 framework that continuously evaluates risk during agent operation. Rather than asking “is this model safe?” once before deployment, it asks “is this specific action safe right now?” at every decision point, enabling more nuanced control over autonomous offensive security operations.

Technical Insight

The framework builds on InterCode, leveraging Docker containers to provide isolated CTF (Capture The Flag) environments where agents can execute bash commands without threatening production systems. The architecture separates model hosting from evaluation logic: models run via vLLM API on a host machine, while the evaluation harness spawns multiple Docker containers to test agent behavior in parallel.

The risk assessment mechanism operates through four complementary strategies. First, repeated sampling quantifies uncertainty by running the same task multiple times and measuring consistency—if an agent produces wildly different solutions across runs, that’s a red flag indicating unreliable behavior. Second, iterative prompt refinement adapts instructions based on failure modes observed during evaluation. Third, self-training on successful trajectories creates a feedback loop where agents learn from their own successes in development environments. Fourth, workflow refinement (the most sophisticated approach) breaks complex penetration testing tasks into verifiable sub-steps with explicit risk checkpoints.

Here’s what a basic evaluation setup looks like:

# Example configuration for running dynamic risk assessment
from intercode_ctf.benchmark import CTFBenchmark
from vllm import LLM, SamplingParams

# Initialize model with specific sampling parameters
model = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    max_model_len=8192
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=2048,
    n=10  # Generate 10 samples for pass@k evaluation
)

# Run evaluation with risk assessment
benchmark = CTFBenchmark(
    docker_image="ctf-env:latest",
    num_workers=10,  # Parallel Docker containers
    timeout_per_task=300
)

results = benchmark.evaluate(
    model=model,
    sampling_params=sampling_params,
    risk_assessment_mode="iterative",
    max_refinement_iterations=3
)

# Compute pass@k metrics with confidence intervals
for k in [1, 5, 10]:
    success_rate = results.compute_pass_at_k(k)
    confidence = results.bootstrap_confidence_interval(k, n_bootstrap=1000)
    print(f"Pass@{k}: {success_rate:.2%} (95% CI: {confidence})")

The statistical rigor is noteworthy: the framework runs each task 12 times and computes pass@k metrics—the probability that at least one of k attempts succeeds. This approach, borrowed from code generation research, captures both capability (can the agent solve the task at all?) and reliability (how consistently does it succeed?). A model with 50% pass@1 but 90% pass@5 indicates high capability with high variance—exactly the kind of inconsistent behavior that makes autonomous deployment risky.

The self-training pipeline demonstrates the framework’s research-oriented nature. It collects successful agent trajectories from development sets, formats them as training examples, and uses them to fine-tune the base model. This creates a virtuous cycle where agents learn from their own successful penetration testing attempts:

# Self-training workflow (simplified)
# 1. Collect successful trajectories from dev set
successful_trajectories = benchmark.filter_successful(
    results_dev,
    min_confidence=0.8
)

# 2. Format as training examples
training_data = [
    {
        "instruction": trajectory.task_description,
        "context": trajectory.environment_state,
        "response": trajectory.command_sequence,
        "risk_level": trajectory.assessed_risk
    }
    for trajectory in successful_trajectories
]

# 3. Fine-tune model (requires S1 framework setup)
from s1.training import train_model
finetuned_model = train_model(
    base_model=model,
    training_data=training_data,
    training_config="configs/self_train.yaml"
)

The Docker isolation is critical for safety. Each evaluation spawns a fresh container with a pre-configured CTF challenge, executes agent commands within that sandbox, and tears down the environment afterward. This prevents agents from accessing the host system even if they discover container escape vulnerabilities—though the README wisely notes that determined agents with sufficient capability could theoretically break out, requiring additional security hardening for truly adversarial scenarios.

Gotcha

This is decidedly a research framework, not production tooling, and the operational complexity reflects that reality. You’ll need substantial computational infrastructure: a machine capable of hosting 70B+ parameter models via vLLM (realistically requiring multiple high-end GPUs), the ability to run 10+ parallel Docker containers simultaneously, and storage for extensive logging and trajectory collection. The documentation assumes familiarity with academic ML workflows and leaves several implementation details as “exercises for the reader.”

The self-training component has particularly sharp edges. It requires installing the S1 framework separately, with its own dependency tree that may conflict with the main evaluation environment. Several scripts reference hardcoded paths that need manual adjustment for your setup. The workflow refinement strategy—the most theoretically interesting approach—has minimal documentation beyond the research paper, meaning you’ll be reading academic prose and reverse-engineering implementation details from sparse code comments. With only 8 GitHub stars and no evidence of active community development, you’re largely on your own for troubleshooting.

Moreover, the framework evaluates on specific CTF benchmarks (InterCode CTF, CyBench, NYU CTF) that may not reflect your actual penetration testing scenarios. Real-world security assessments involve custom applications, unique network topologies, and business context that standardized benchmarks cannot capture. The risk assessment strategies are validated against these academic datasets but may require significant adaptation for operational use.

Verdict

Use if: You’re a security researcher investigating safe deployment of autonomous offensive agents, need rigorous benchmarking methodology for comparing LLM-based pentesting approaches, or want to implement dynamic risk assessment in experimental security automation systems. The statistical evaluation framework alone—with pass@k metrics, confidence intervals, and isolated execution environments—provides valuable methodology even if you don’t adopt the full system. Also valuable if you’re writing papers on AI safety in cybersecurity and need a peer-reviewed reference implementation. Skip if: You need production-ready penetration testing tools for actual security assessments, lack multi-GPU infrastructure for hosting large language models, or want something with active community support and comprehensive documentation. Also skip if you’re looking for plug-and-play automation rather than a research platform requiring significant customization. For operational pentesting, stick with PentestGPT or traditional Metasploit workflows augmented with LLM assistance through simpler integrations.

Dynamic Risk Assessment: Teaching AI Hackers When to Stop Before They Cause Real Damage

Dynamic Risk Assessment: Teaching AI Hackers When to Stop Before They Cause Real Damage

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Dynamic Risk Assessment: Teaching AI Hackers When to Stop Before They Cause Real Damage

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

RedAmon: Teaching AI Agents to Break Into Your Infrastructure While You Sleep

Sherlock: How to Query 400+ Social Networks With a Single Command

DefaultCreds-cheat-sheet: Mining 3,711 Vendor Passwords Security Teams Forgot to Change

70,000 WordPress Vulnerabilities in One Command: Inside the Nuclei-Wordfence Pipeline

RedAmon: Teaching AI Agents to Break Into Your Infrastructure While You Sleep

Sherlock: How to Query 400+ Social Networks With a Single Command

DefaultCreds-cheat-sheet: Mining 3,711 Vendor Passwords Security Teams Forgot to Change

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]