How HEARTH Turns GitHub Issues Into a Threat Hunting Knowledge Base
Hook
What if threat hunters could contribute to a shared knowledge base by simply pasting a URL into a GitHub issue, and AI handled the rest?
Context
Threat hunting lives in a documentation desert. Security teams read brilliant CTI reports, mentally note “we should hunt for that,” and then… nothing happens. The hypothesis stays in someone’s head or a Slack thread that’s impossible to find six months later. Even when teams document hunts, they’re scattered across wikis, Google Docs, and proprietary platforms that die when the author leaves.
HEARTH exists because the THOR Collective recognized that threat hunting knowledge needed what open source gave to code: version control, community collaboration, and a single source of truth. But unlike code, threat hunting hypotheses are narrative artifacts—they need structure (MITRE ATT&CK mappings, tactical categorization) without sacrificing readability. The project solves this by treating Markdown files as the database and using AI to handle the tedious parts of hypothesis generation, letting hunters focus on the intellectual work of identifying what to hunt.
Technical Insight
HEARTH’s architecture is deceptively simple: Markdown files are the source of truth, GitHub Actions orchestrate everything, and AI does the heavy lifting. When you submit a CTI report URL via GitHub Issues, a workflow triggers that uses Claude’s API (or OpenAI as fallback) to extract threat intelligence, generate a PEAK-categorized hypothesis, and validate MITRE ATT&CK technique IDs against a 691-technique index.
The duplicate detection system is where things get interesting. The team added SQLite with vector storage, achieving 30-60x speedups over the original approach. The database schema structures hunts with fields for category, title, hypothesis, techniques, embeddings, and file paths. When a new hunt arrives, the system generates an embedding for the hypothesis text, then queries similar entries using cosine similarity. If matches exceed a threshold, the bot flags potential duplicates before creating a pull request. This runs entirely in GitHub Actions—no persistent infrastructure required.
The CTI extraction workflow demonstrates production-grade AI integration. HEARTH supports both Claude and OpenAI APIs with configurable models (defaulting to claude-sonnet-4-5-20250929). The system handles multiple content formats including HTML, PDF, and DOCX, with compression support (Brotli, Zstandard, Gzip) and JavaScript-rendering fallbacks for JS-heavy sites. Configuration is managed via environment variables: AI_PROVIDER, ANTHROPIC_API_KEY, OPENAI_API_KEY, and CLAUDE_MODEL.
Each generated hunt follows a strict template with hypothesis statement, PEAK category (Flames for hypothesis-driven with 100+ hunts, Embers for baselining with 17, Alchemy for ML-assisted with 14), MITRE techniques, and data sources. The MITRE validation cross-references technique IDs like T1055.001 against the Enterprise ATT&CK framework, which the README states achieves 99% accuracy.
The frontend is a static GitHub Pages site built from the Markdown files, providing a searchable, filterable interface. This means the entire system runs on GitHub’s free tier: Actions for automation, Pages for hosting, Issues for collaboration. The only cost is AI API calls. What makes this architecture elegant is its Git-native design. Every hunt is a file. Every contribution is a PR. Forking the entire knowledge base is git clone. Organizations can vendor the repository, customize categorization, and contribute improvements upstream—the standard open source flywheel, applied to threat hunting knowledge.
Gotcha
HEARTH’s dependency on external AI APIs is both its superpower and its constraint. Claude and OpenAI costs could become prohibitive as submission volume grows, and if either provider changes pricing or terms of service, the automation pipeline would need adjustment. The README confirms there’s no mention of local LLM alternatives, which would limit air-gapped deployments without modifications.
The quality control process requires maintainer review. While AI generates drafts, the README mentions a regenerate label for iterating on AI output, suggesting refinement is sometimes needed. The project currently has 130+ hunts with 29+ contributors, and the README acknowledges maintainers review submissions before merging.
GitHub Issues as a submission interface lowers barriers for contributors familiar with GitHub workflows, but organizations requiring enterprise features or uncomfortable with public repositories would need to consider self-hosting. The README notes that self-hosted instances require setting up environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, HEARTH_TOKEN) and configuring GitHub Actions workflows, which involves more than simply cloning the repository.
Verdict
Use HEARTH if you’re a security team that needs ready-made threat hunting hypotheses categorized by the PEAK framework (100+ Flames hypothesis-driven hunts, 17 Embers baselining hunts, 14 Alchemy ML-assisted hunts), wants to contribute ideas through GitHub’s collaboration model, or seeks a living knowledge base mapped to MITRE ATT&CK with 99% technique validation accuracy. It’s well-suited for teams comfortable with GitHub workflows and those willing to work within the AI-powered automation that handles CTI extraction and hypothesis generation. The 130+ existing hunts provide immediate value even without contributing. Consider alternatives if you require completely offline operation, need deployments without external AI API dependencies, prefer to avoid GitHub-based workflows, or already have a mature internal hunt documentation system. Also evaluate carefully if your organization requires enterprise access controls beyond GitHub’s offerings or has policies against public collaboration platforms.