Teaching AI to Think Like a CIA Analyst: Structured Analysis Skills for LLMs
Hook
The CIA spent decades learning how to avoid catastrophic intelligence failures like Pearl Harbor and Iraqi WMDs. Now those same cognitive bias-mitigation techniques are being encoded as AI prompts.
Context
Large language models are impressive conversationalists, but their reasoning often resembles a confident undergraduate—fluent, creative, and occasionally dangerously wrong. They confidently hallucinate sources, fall prey to confirmation bias, and anchor on initial hypotheses without considering alternatives. These aren't just annoying quirks; they're the same cognitive pitfalls that led human analysts to miss critical warnings before Pearl Harbor and to confidently assess that Iraq possessed WMDs.
The intelligence community spent the last 70 years developing Structured Analytic Techniques (SATs)—formal methodologies to counteract these exact failure modes. After the Iraq WMD debacle, these techniques were codified in the CIA's Tradecraft Primer (2009) and became mandatory training. The structured-analysis-skill repository makes a bold claim: these human-tested frameworks can be translated into prompt engineering patterns that guide LLMs through the same rigorous analytical workflows that professional intelligence analysts use. Instead of building yet another agentic framework with Python orchestration layers, it implements the entire system as plain Markdown files that AI assistants can discover and execute.
Technical Insight
The architecture is deceptively simple—a collection of Markdown protocol files—but the execution reveals sophisticated meta-prompting. When you invoke the skill with /analyze, Claude Code doesn't just start generating text. It orchestrates a six-phase analytical workflow that mirrors how intelligence agencies approach complex problems.
The evidence collection layer is particularly clever. Rather than relying solely on the LLM's training data (a recipe for hallucination), the system implements a three-tier evidence hierarchy:
# Evidence Collection Protocol
## Tier 1: Conversation Context
- Review previous 10+ messages for problem framing
- Extract stakeholder claims, constraints, hypotheses
- Document: "User stated X on [timestamp]"
## Tier 2: Local File System
- Search workspace for relevant documentation
- Priority: config files, logs, technical specs
- Citation format: [filename:line_range]
## Tier 3: OSINT (Open Source Intelligence)
- Firecrawl API for structured web scraping
- Fallback: WebSearch + WebFetch tools
- Mandatory: Archive URLs, capture timestamps
This isn't just good practice—it's enforced architecture. Every analytical output must include a citation section. If the LLM tries to make a claim without a corresponding evidence pointer, the self-correction layer flags it as a high-severity weakness.
The technique selection mechanism demonstrates thoughtful prompt engineering. Instead of forcing users to understand 18 different analytical frameworks, the system includes a rubric-based router:
# Conceptual logic (actual implementation is in Markdown)
def select_technique(problem_characteristics):
if problem_characteristics.has_multiple_plausible_explanations:
return "Analysis_of_Competing_Hypotheses" # ACH
elif problem_characteristics.requires_future_projection:
return "Alternative_Futures_Analysis"
elif problem_characteristics.involves_deception_detection:
return "Deception_Detection"
# ... 15 more technique mappings
The system infers analytical intent from conversation context. If you've been discussing why a distributed system is failing intermittently, it recognizes this as a diagnostic problem with competing explanations and routes to Analysis of Competing Hypotheses (ACH). If you're evaluating whether to adopt a new technology, it triggers Decision Matrix analysis.
The self-correction layer is where the intelligence tradecraft really shines. After generating an analytical product, the system doesn't just output results—it validates them against structured rubrics:
## Analytical Quality Rubric
### High-Severity Weaknesses (Auto-Remediate)
- [ ] Uncited claims (evidence tier missing)
- [ ] Single hypothesis considered (confirmation bias)
- [ ] No alternative explanations explored
- [ ] Assumptions not made explicit
### Medium-Severity (Flag for Review)
- [ ] Incomplete evidence search
- [ ] Weak source diversity
- [ ] Limited stakeholder perspectives
### Low-Severity (Document)
- [ ] Minor citation formatting issues
- [ ] Incomplete caveat statements
When high-severity weaknesses are detected, the system doesn't just warn you—it automatically re-runs the relevant analytical phase with corrective prompts. If Analysis of Competing Hypotheses only evaluated two explanations when four were plausible, it forces a re-analysis with explicit instructions to generate and evaluate alternative hypotheses.
The protocol files themselves are elegant examples of constrained generation. Here's a simplified excerpt from the ACH protocol:
# Analysis of Competing Hypotheses (ACH)
## Phase 1: Hypothesis Generation
1. List ALL plausible explanations (minimum 3)
2. Include "unlikely but high-impact" scenarios
3. Format: H1, H2, H3...
## Phase 2: Evidence Matrix
Create table:
| Evidence | H1 | H2 | H3 | Source |
|----------|----|----|----|---------|
| [item] | ++ | -- | N/A | [cite] |
Legend: ++ consistent, + somewhat, N/A not applicable, - inconsistent, -- strongly inconsistent
## Phase 3: Refutation Analysis
- Focus on DISCONFIRMING evidence (not confirming)
- Which hypothesis is LEAST refuted?
- Document: "H2 remains viable despite evidence E1, E3"
This inverts the natural (biased) tendency to seek confirming evidence—a technique straight from the Tradecraft Primer. The LLM follows these structural constraints because they're embedded in the protocol file, effectively turning free-form generation into a guided analytical workflow.
Gotcha
The overhead is real and non-negotiable. Even in 'lean' mode, expect 15+ minutes and substantial token consumption for a single analysis. The system makes multiple API calls—evidence gathering, technique execution, self-correction validation, and potentially automated remediation. For quick exploratory questions or brainstorming sessions, this is absurdly over-engineered. You're paying for rigor that you don't need.
The Claude Code dependency creates a significant portability problem. While the protocols are theoretically platform-agnostic Markdown files, the automatic skill discovery, file system access, and OSINT tool integration only work seamlessly in Claude's environment. Using this with ChatGPT or open-source models means manually copying protocol files, orchestrating evidence collection yourself, and losing the self-correction automation. You're left with a manual checklist rather than an integrated system. The Firecrawl requirement for quality OSINT adds another integration hurdle—the fallback WebSearch works but provides noticeably degraded evidence quality, which undermines the entire analytical chain. If you're analyzing a niche technical topic with limited web coverage, expect the evidence tier to be sparse and the resulting analysis to be appropriately hedged (which is honest, but perhaps not what you hoped for).
Verdict
Use if: You're making high-stakes decisions where wrong conclusions are expensive—security incident response, architecture reviews for mission-critical systems, vendor evaluation for multi-year commitments, or post-mortem analysis of complex failures. The 15+ minute overhead and token costs are justified when bias-driven errors could cost orders of magnitude more. Also ideal if you're already trained in structured analysis and want LLM augmentation for evidence gathering and hypothesis tracking, or if you're using Claude Code and can leverage the full automation. Skip if: You need quick answers, are brainstorming exploratively, or are working on routine decisions where rough heuristics suffice. Also skip if you're not on Claude Code and aren't willing to manually orchestrate the protocols—the friction negates most benefits. Finally, skip for truly adversarial scenarios (security assessments where the adversary knows you're using these techniques) or classified contexts where AI augmentation introduces unacceptable risk. This tool augments human judgment; it doesn't replace the need for experienced analysts who understand when tradecraft formalism helps versus when it's security theater.