Citadel: Campaign-Persistent Agent Orchestration for Claude Code
Hook
After orchestrating 198 autonomous agents across 32 parallel sessions, one developer built the infrastructure that keeps multi-day AI coding projects from collapsing under their own context weight.
Context
If you’ve used Claude Code or similar AI coding assistants for anything beyond trivial tasks, you’ve hit the fundamental problem: LLM sessions are ephemeral, but real engineering work takes days or weeks. You start a refactoring on Monday, the context window fills up by Tuesday afternoon, and by Wednesday you’re manually re-explaining the architecture because the session state is gone. Scale this to multiple parallel tasks—say, migrating an API across twenty microservices—and you’re drowning in coordination overhead.
Citadel emerged from this exact pain point. Built by Seth Gammon after running 198 autonomous agents and debugging 27 postmortems, it’s a meta-orchestration harness specifically for Claude Code that treats multi-day, multi-agent workflows as a first-class problem. The core insight: persist campaign state as structured markdown in .planning/, route commands through a token-optimized four-tier cascade, and use Git worktrees to isolate parallel agents. This isn’t a general-purpose agent framework trying to solve AGI—it’s production infrastructure for sustained AI-assisted development with Anthropic’s tools.
Technical Insight
Citadel’s architecture revolves around intelligent routing that avoids burning tokens on trivial decisions. The four-tier cascade starts with zero-cost pattern matching (regex against commands like ‘/review src/auth.ts’), checks for active campaign context (also zero tokens—just reads .planning/*.md files), attempts skill keyword matching (pattern-based dispatch to six production skills: code review, test generation, documentation, refactoring, debugging, and architecture analysis), and only falls back to LLM classification if those fail. This means simple requests like ‘fix the typo in README.md’ consume zero API calls for routing, while complex requests like ‘refactor the authentication system to use OAuth2’ get the full LLM treatment.
The campaign persistence system is elegantly simple: markdown files in .planning/ that capture phases, decisions, and continuation state. When you start a multi-day refactoring, Citadel creates a campaign file like this:
# Campaign: OAuth2 Migration
Status: IN_PROGRESS
Phase: 2/4
## Phases
1. [COMPLETED] Audit existing auth system
2. [ACTIVE] Replace session tokens in API layer
3. [PENDING] Migrate frontend auth flows
4. [PENDING] Remove legacy auth code
## Decisions
- Using Authorization Code flow (not Implicit)
- Keeping existing user table, adding oauth_tokens relation
- Auth0 as provider (existing enterprise contract)
## Context for Next Session
Completed token replacement in /api/auth/*.ts. Tests passing.
Next: Update frontend AuthContext to use new endpoints.
Watch out: Mobile app still expects old token format.
When you return two days later and type ‘/do continue the OAuth migration,’ Citadel reads this file, reconstructs context, and Claude picks up exactly where it left off—no manual re-explanation needed. The system automatically updates phase status and appends new decisions as work progresses.
Parallel execution via the Fleet tier uses Git worktrees for isolation. When you invoke ‘/fleet migrate all microservices to TypeScript 5.3,’ Citadel spins up isolated worktree checkouts (think lightweight git clones sharing the same .git database) and assigns each microservice to a separate Claude instance. The discovery relay system creates a shared .fleet/discoveries.jsonl file where agents log findings:
{"agent": "auth-service", "finding": "TypeScript 5.3 breaks decorators, need experimentalDecorators: false", "timestamp": "2024-01-15T14:32:11Z"}
{"agent": "payment-service", "finding": "@types/node@20.10.0 required for fetch types", "timestamp": "2024-01-15T14:35:22Z"}
Each agent reads this file before making architectural decisions, preventing duplicate debugging and spreading solutions across the fleet. After all agents complete, Citadel reconciles changes back to the main branch, handling merge conflicts through a final Claude session with full diff context.
Lifecycle hooks provide quality gates without manual intervention. The system detects your project language (checks for tsconfig.json, package.json, go.mod, Cargo.toml) and automatically runs the appropriate type checker after file edits:
// Simplified from Citadel's hook system
const POST_EDIT_HOOKS = {
typescript: async (files) => {
const result = await exec('npx tsc --noEmit');
if (result.exitCode !== 0) {
return { status: 'FAILED', feedback: result.stderr };
}
return { status: 'PASSED' };
},
python: async (files) => {
const result = await exec('mypy ' + files.join(' '));
// ... similar pattern
}
};
If a hook fails, the circuit breaker kicks in: Citadel feeds the error back to Claude with retry instructions, up to three attempts before escalating to human review. This caught issues in production where Claude would make syntactically valid but type-unsafe edits, burning through API credits on cascading failures.
The six production skills are pre-prompted Claude sessions optimized for specific tasks. The code review skill, for instance, uses a prompt template that emphasizes security, performance, and maintainability, then outputs structured markdown reports. Skills consume campaign context automatically—if you’re mid-migration and request a review, it knows to check compliance with migration decisions logged in .planning/.
Gotcha
The Claude Code dependency is absolute. This harness is tightly coupled to Anthropic’s specific CLI tool, prompt format, and API behaviors. If you’re using Cursor, Aider, or GPT-4 via the OpenAI API, Citadel won’t help you—there’s no abstraction layer for provider swapping. The author made a deliberate trade-off: deep integration with one tool over portability across many. That’s fine if you’re Claude-committed, but it means you can’t A/B test providers or gracefully degrade to cheaper models.
Git worktree parallelism requires careful state management and disk space planning. Each Fleet agent creates a full working directory (minus the .git database, which is shared). For a monorepo with 20 microservices at 500MB each, you’re looking at 10GB of checkouts. The discovery relay system is eventually consistent—if Agent A finds a breaking change at 14:32 but Agent B started its work at 14:30, Agent B won’t see that discovery until its next planning phase. This can create rework or merge conflicts that require human resolution. The documentation is honest about this (‘27 postmortems’ suggests hard-won lessons) but doesn’t provide formal testing infrastructure. The harness itself is JavaScript configuration files and shell scripts with no unit tests, making it challenging to verify correctness or safely contribute improvements without breaking production workflows.
Verdict
Use if: You’re already invested in Claude Code and running multi-day engineering projects where context preservation across sessions is critical—think large refactorings, API migrations across multiple services, or incremental feature development that spans weeks. The campaign persistence and four-tier routing solve real pain for sustained AI-assisted work, and the Fleet tier’s parallel execution genuinely accelerates work that can be parallelized. This is battle-tested infrastructure from someone who’s debugged the failure modes. Skip if: You’re using a different LLM provider or coding assistant (zero portability), doing quick one-off tasks where setup overhead exceeds benefit (the .planning/ directory structure and skill configuration require upfront investment), need formal stability guarantees or comprehensive test coverage (this is scripted infrastructure, not a tested library), or lack disk space for Git worktree parallelism (each Fleet agent needs a full checkout). This is opinionated tooling for a specific workflow with Claude Code—that focus is its strength and its constraint.