Teaching AI Agents to Think Like CIA Analysts: Inside Structured Analysis Skill
Hook
The CIA’s 2009 Tradecraft Primer isn’t just for spies anymore—it’s now a Python framework teaching AI agents to avoid the same cognitive biases that led to intelligence failures like Iraqi WMDs.
Context
Large language models hallucinate, jump to conclusions, and cherry-pick evidence that confirms their initial hunches—exactly like humans do. When GPT-4 analyzes a security incident or evaluates competing business strategies, it exhibits the same confirmation bias that plagued the intelligence community’s assessment of weapons of mass destruction in Iraq. The difference? Human analysts have spent decades developing Structured Analytic Techniques (SATs) to combat these biases through rigorous protocols like Analysis of Competing Hypotheses and Key Assumptions Checks.
structured-analysis-skill bridges this gap by implementing 18 battle-tested intelligence techniques as executable protocols for Claude Code. Rather than letting AI agents freewheel through analysis, it enforces the same disciplined thinking that CIA analysts use: explicit hypothesis generation, systematic evidence evaluation, assumption testing, and multi-layer self-correction. The framework emerged from a recognition that AI-assisted analysis faces identical cognitive traps as human analysis, but can leverage computational advantages like exhaustive evidence tracking and automated bias checks that would be prohibitively tedious for humans.
Technical Insight
The architecture operates as a markdown-based protocol orchestration system that guides Claude through six analytical phases: Problem Assessment, Evidence Collection, Technique Selection, Structured Execution, Self-Correction, and Report Generation. Each phase is defined in detailed markdown files that Claude ingests as procedural instructions, effectively turning declarative documents into executable analytical pipelines.
The evidence gathering stage demonstrates the framework’s sophistication. It implements a three-tier hierarchy prioritizing conversation context, then local files, finally OSINT sources via optional MCP server integration with Firecrawl or WebSearch. Evidence gets registered with mandatory provenance tracking:
# Evidence Registry Structure (conceptual - actual implementation in markdown protocols)
evidence_registry = {
"conversation": [
{"id": "CONV-001", "content": "User mentioned Q3 revenue decline", "tier": 1}
],
"local": [
{"id": "FILE-001", "source": "./financial_report.pdf", "content": "...", "tier": 2}
],
"osint": [
{"id": "WEB-001", "url": "https://...", "retrieval_date": "2024-01-15", "tier": 3}
]
}
# Every analytical claim must include citation array
claim = {
"assertion": "Revenue declined due to supply chain disruption",
"evidence_ids": ["CONV-001", "FILE-001", "WEB-001"],
"technique_applied": "Analysis of Competing Hypotheses"
}
Technique selection happens adaptively based on problem characteristics parsed from the initial assessment. For problems involving competing explanations, the system defaults to Analysis of Competing Hypotheses (ACH)—the gold standard for reducing confirmation bias. ACH forces the analyst (human or AI) to generate multiple hypotheses upfront, then evaluate evidence against ALL hypotheses simultaneously, flagging which evidence is diagnostic (strongly discriminates between hypotheses) versus non-diagnostic. The protocol file for ACH includes explicit steps:
## ACH Protocol Execution
1. Generate 3-7 competing hypotheses (must be mutually exclusive)
2. List all evidence items from registry
3. Create evidence-hypothesis matrix
4. Score each evidence item against each hypothesis:
- CC (Clearly Consistent): +2
- SC (Somewhat Consistent): +1
- N (Neutral): 0
- SI (Somewhat Inconsistent): -1
- CI (Clearly Inconsistent): -2
5. Calculate diagnosticity: evidence is diagnostic if scores vary by ≥3 points across hypotheses
6. Identify hypothesis with least inconsistent evidence (not most consistent—this is crucial)
7. Document reasoning in final report with evidence citations
The three-layer self-correction mechanism sets this apart from naive AI workflows. After generating analysis, the skill validates output against a 14-question rubric covering analytical rigor, evidence quality, assumption documentation, and alternative hypothesis consideration. Issues are severity-ranked (HIGH/MEDIUM/LOW), and HIGH-severity problems trigger automatic remediation—the system rewrites the flawed section before presenting to users. This happens transparently:
## Self-Correction Layer 1: Analytical Rubric
- Q1: Are assumptions explicitly stated and challenged?
- Q2: Is contrary evidence addressed, not dismissed?
- Q3: Are alternative hypotheses genuinely considered?
...
## Auto-Remediation Logic
IF rubric_score["assumption_documentation"] == HIGH_SEVERITY:
EXECUTE assumption_check_protocol()
REGENERATE analysis_section("assumptions")
REVALIDATE rubric()
The framework supports three operational modes: full (all 18 techniques available, ~2 hour runtime), guided (technique pre-selected by user, ~45 minutes), and lean (streamlined evidence collection, ~20 minutes). This acknowledges the practical tension between analytical rigor and time constraints—intelligence analysts face the same tradeoff when deciding whether to run a full Red Team Analysis versus a quick Key Assumptions Check.
Context-aware invocation is surprisingly elegant. Users can have a natural conversation about a problem, then simply invoke /analyze without re-explaining context. The skill extracts analytical requirements from conversation history, treating prior messages as tier-1 evidence. This mirrors how human analysts work: gathering information through discussion, then transitioning to formal structured analysis when stakes warrant it.
Gotcha
The framework is tightly coupled to Claude Code’s specific environment. While the protocols are technically portable markdown files, manually feeding them to other AI assistants means juggling multiple documents as context, losing the seamless workflow integration that makes the skill practical. You’re essentially locked into Anthropic’s ecosystem—there’s no ChatGPT or Gemini version waiting in the wings.
Performance overhead is substantial. The full analysis mode can take hours, and even lean mode requires 20+ minutes. The self-correction loops and comprehensive evidence gathering create latency that’s incompatible with real-time decision support. If you need an answer in the next 5 minutes, this framework will frustrate you. Additionally, with only 4 GitHub stars and no published validation studies, there’s zero empirical evidence that these techniques actually improve AI analysis quality compared to well-prompted standard Claude interactions. The intelligence community developed SATs for human cognition over decades of tradecraft evolution—whether they transfer effectively to transformer architectures is theoretically plausible but practically unproven. You’re betting on first principles (bias mitigation should work similarly for humans and LLMs) without controlled experiments confirming the hypothesis.
Verdict
Use if you’re conducting high-stakes analysis where errors have serious consequences—security incident investigation, strategic business decisions, geopolitical risk assessment, forensic analysis—and you have hours (not minutes) for rigorous evaluation. The structured techniques genuinely combat cognitive biases that plague both human and AI reasoning, and the mandatory citation system creates audit trails essential for defended conclusions. Also use if you’re already working in Claude Code and want to augment your analytical workflow with intelligence tradecraft without learning separate tools. Skip if you need rapid exploratory analysis, routine decision support, or real-time recommendations where 20+ minute latency kills utility. Skip if you’re not committed to the Claude ecosystem, as manual protocol management undermines the workflow benefits. Skip if you’re doing creative ideation or open-ended research where rigid analytical structures constrain rather than enhance thinking. This is a power tool for analysts who already value structured thinking and want computational assistance, not a magic bullet that transforms casual users into intelligence professionals.