APTnotes: The Hacker Historian's Database of Advanced Persistent Threat Intelligence
Hook
Since 2008, nation-state hackers have generated thousands of threat intelligence reports—and nearly all of them live in a single GitHub repository managed by volunteers armed with CSV files and Box cloud storage.
Context
Advanced Persistent Threat groups don't advertise their tactics. The only window into their operations comes from incident response reports published by security vendors after breach investigations. FireEye documents a Chinese espionage campaign. Kaspersky exposes Russian infrastructure. CrowdStrike details North Korean malware. Each report is valuable, but they're scattered across vendor blogs, PDF archives, and conference presentations—often disappearing when companies restructure or rebrand.
APTnotes emerged to solve this preservation and discovery problem. Without a centralized index, researchers conducting retrospective analysis of APT campaigns faced an archaeological challenge: manually searching vendor sites, conference proceedings, and dead links. The repository provides a chronological metadata catalog spanning 2008 to present, with standardized fields (title, source, date, SHA-1 hash) and persistent links to actual reports. It's community-curated threat intelligence infrastructure—unglamorous but foundational. Projects like Threat Miner consume APTnotes data to power their search interfaces, demonstrating how raw metadata feeds entire threat intelligence ecosystems.
Technical Insight
APTnotes' architecture is deliberately minimal: a CSV file and a JSON file containing identical structured metadata, paired with reports hosted on Box cloud storage. The simplicity is strategic—version control works naturally with text-based formats, GitHub's web interface renders CSV for human browsing, and JSON enables programmatic consumption. Each record contains seven fields: filename, title, source organization, SHA-1 hash, publication date, year, and Box download URL.
The data structure is flat and intentionally unsophisticated. Here's what the JSON schema looks like in practice:
[
{
"filename": "APT1.pdf",
"title": "APT1: Exposing One of China's Cyber Espionage Units",
"source": "Mandiant",
"sha1": "7e1a8f3c9b2d4a6e5c8f1a3b7d9e2f4a6c8e1a3b",
"date": "2013-02-19",
"year": 2013,
"url": "https://app.box.com/s/..."
}
]
This structure optimizes for append-only operations. Contributors add new reports by inserting rows; the chronological organization makes git diffs readable. SHA-1 hashes serve dual purposes: integrity verification (confirming downloaded PDFs match catalog entries) and deduplication detection (identifying when multiple vendors report the same campaign under different names).
The migration from direct storage to Box hosting was architecturally significant. GitHub repositories have practical size limits, and PDFs accumulate quickly. By externalizing binary assets, APTnotes maintains version control benefits for metadata while delegating storage scalability to Box. The trade-off is dependency—the entire corpus becomes inaccessible if Box links break. The separate aptnotes-downloader tool repository provides bulk download scripts:
import json
import requests
from pathlib import Path
# Load APTnotes JSON catalog
with open('APTnotes.json') as f:
reports = json.load(f)
# Download reports with SHA-1 verification
for report in reports:
response = requests.get(report['url'])
pdf_path = Path(f"reports/{report['filename']}")
pdf_path.write_bytes(response.content)
# Verify integrity
import hashlib
downloaded_hash = hashlib.sha1(response.content).hexdigest()
if downloaded_hash != report['sha1']:
print(f"Hash mismatch for {report['filename']}")
This pattern enables researchers to build local archives with verified integrity. You can filter by date range, source organization, or year to create focused datasets. For example, analyzing how reporting evolved during specific geopolitical events:
# Extract reports from 2016 US election period
election_reports = [
r for r in reports
if '2016-06-01' <= r['date'] <= '2017-01-31'
and any(keyword in r['title'].lower()
for keyword in ['russia', 'apt28', 'fancy bear', 'dnc'])
]
print(f"Found {len(election_reports)} election-related reports")
for report in election_reports:
print(f"{report['date']}: {report['title']} ({report['source']})")
The contribution model uses multiple channels: a Twitter hashtag (#aptnotes), GitHub issues with templates, and direct maintainer contact. The standardized issue template requests the same seven fields present in the data schema, reducing friction for contributors. This crowd-sourced curation scales better than single-maintainer models—the community collectively monitors vendor publications and conference presentations.
APTnotes doesn't prescribe how to consume the data. You could build a static site generator that renders reports by APT group, create a Slack bot that notifies channels when new reports appear, or integrate it into SIEM correlation rules. The flat structure and stable schema make it an ideal foundation layer for derivative tools.
Gotcha
The Box dependency is a single point of failure masquerading as a solution. While externalizing storage solved GitHub's size constraints, it introduced availability and longevity risks. Box is a commercial service—if the account gets suspended, billing lapses, or Box changes access policies, thousands of reports vanish despite metadata remaining in version control. There's no redundancy or mirror strategy. Researchers building critical infrastructure on APTnotes should maintain local mirrors using the downloader tools.
The metadata quality is inconsistent because curation is manual and volunteer-driven. APT group naming lacks standardization—the same Chinese group appears as "APT1," "Comment Crew," and "Unit 61398" across different reports, but APTnotes doesn't normalize or cross-reference these aliases. Date fields occasionally contain partial information (year only, no specific month). Some reports have SHA-1 hashes, others don't. There's no schema validation enforcing completeness. If you're building automated analysis pipelines, expect to write defensive parsing code that handles missing fields and inconsistent naming conventions. The repository is historical documentation, not a structured threat intelligence feed with guaranteed IOCs or machine-readable TTPs.
Verdict
Use APTnotes if you're conducting historical research on APT campaigns, building threat intelligence aggregation tools, training machine learning models on security reports, or need a chronological catalog of vendor publications. It's perfect for academic research, retrospective incident analysis, and understanding how threat actor TTPs evolved over years. The community maintenance and public accessibility make it ideal for open-source security projects. Skip it if you need real-time threat intelligence (reports are curated days or weeks after publication), require guaranteed document availability without building local mirrors, need structured IOCs or STIX/TAXII-formatted data, or want normalized APT group taxonomies. This is a discovery and preservation layer, not an operational threat feed—treat it as the library catalog, not the detection engine.