Back to Articles

Building a Machine Learning Pipeline to Map Threat Reports to MITRE ATT&CK

[ View on GitHub ]

Building a Machine Learning Pipeline to Map Threat Reports to MITRE ATT&CK

Hook

Security analysts spend 40% of their time manually tagging threat reports with MITRE ATT&CK frameworks—a task that's perfect for automation but surprisingly difficult to get right.

Context

The MITRE ATT&CK framework has become the lingua franca of cybersecurity, providing a common taxonomy for describing adversarial behaviors. Security operations centers, threat intelligence teams, and red teams all speak ATT&CK. But there's a translation problem: most threat intelligence arrives as unstructured text—PDF reports from vendors, blog posts about APT campaigns, incident response writeups. Converting these narratives into structured ATT&CK mappings is tedious, error-prone, and doesn't scale.

rcATT (Report Classification for ATT&CK) emerged from this friction point as a Master's thesis project by vlegoy. It tackles the core challenge: can we train a machine learning model to read a threat report and automatically predict which ATT&CK tactics and techniques it describes? The project takes a pragmatic approach, using classical NLP and scikit-learn rather than heavyweight deep learning, making it accessible for security teams without GPU clusters. More importantly, it implements an active learning loop where analysts can correct the model's mistakes and immediately retrain—acknowledging that perfect prediction is impossible but continuous improvement is achievable.

Technical Insight

rcATT's architecture centers on multi-label classification, which is crucial because threat reports rarely describe a single technique. A typical APT report might discuss initial access via spear-phishing (T1566), credential dumping (T1003), lateral movement with PsExec (T1570), and data exfiltration (T1041) all in one document. The system needs to tag all of them, not pick the "best" match.

The pipeline starts with classical NLP preprocessing using NLTK. Text gets tokenized, lowercased, stripped of stopwords, and lemmatized—converting "exploited," "exploiting," and "exploitation" into a single token. This normalization is critical because security reports use varied language to describe the same techniques. The processed text then flows through a TF-IDF vectorizer that converts documents into numerical feature vectors, weighting terms by their importance across the corpus.

For classification, rcATT implements a One-vs-Rest strategy wrapping a LinearSVC (Support Vector Classifier). Here's the core training pattern:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

# Build the classification pipeline
classifier = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 3),
        min_df=2,
        sublinear_tf=True
    )),
    ('clf', OneVsRestClassifier(
        LinearSVC(class_weight='balanced')
    ))
])

# Train on labeled reports
classifier.fit(preprocessed_reports, attack_labels)

# Predict with confidence scores
predictions = classifier.predict(new_report)
confidence_scores = classifier.decision_function(new_report)

The One-vs-Rest approach trains a separate binary classifier for each ATT&CK technique, so the model learns "Does this report describe T1003?" independently from "Does it describe T1566?" This allows the system to confidently predict multiple techniques simultaneously. The class_weight='balanced' parameter is essential because some techniques (like credential dumping) appear far more frequently in reports than others (like bootkit attacks), and without balancing, the model would ignore rare-but-important techniques.

What makes rcATT particularly interesting is its active learning implementation. After prediction, analysts can review results through either a CLI or Flask web interface, correct mistakes, and trigger retraining. The corrected examples get appended to the training dataset, and the model rebuilds itself. This feedback loop is critical in cybersecurity where the threat landscape evolves constantly—new techniques get added to ATT&CK, attackers change tradecraft, and terminology shifts. A static model trained once will decay in accuracy over time.

The system also exports results to STIX 2.0 (Structured Threat Information Expression), the standard format for sharing cyber threat intelligence. This means predictions can flow directly into threat intelligence platforms, SIEM correlation rules, or security orchestration workflows. A typical export maps the original report to a STIX Report object, creates AttackPattern objects for each predicted technique, and links them with Relationship objects—giving you machine-readable threat intelligence from unstructured text.

One architectural choice worth noting: rcATT operates at the report level, not the sentence level. It doesn't try to identify which specific paragraph mentions credential dumping; it tags the entire document. This is simpler to implement and train, but means you lose granular attribution. For many use cases—building a searchable CTI database, triggering defensive playbooks, enriching SIEM alerts—document-level classification is sufficient. If you need sentence-level extraction ("Show me exactly where the report discusses T1003"), you'd need a different approach, likely named entity recognition or sequence labeling.

Gotcha

The biggest limitation is that rcATT hasn't been updated since 2019-2020, evident from its dependencies on older versions of scikit-learn, Flask, and NLTK. The MITRE ATT&CK framework has grown significantly since then—new techniques, sub-techniques, and entire tactics have been added. Training data from that era won't cover recent developments like techniques related to cloud environments, containerization, or modern SaaS attacks. You'll need to build or source an updated training dataset yourself.

Model performance depends entirely on training data quality and coverage. If your training set contains mostly malware-focused reports, the model will struggle with reports about web application attacks or social engineering campaigns. The included dataset is modest, and you'll quickly hit accuracy ceilings without substantial labeled data—which is expensive to create since it requires security analyst time to manually tag reports. The classical ML approach (TF-IDF + LinearSVC) also means the model can't understand context or semantics the way transformer models can. It's looking for keyword patterns, so creative or obfuscated language in reports will confuse it. A report describing "moving laterally through the network using administrative tools" might not get tagged if the training data used different phrasing.

The Flask web interface is bare-bones and not production-hardened. There's no authentication, no proper database (it uses file-based storage), no containerization, and no scalability considerations. It's a prototype suitable for local experimentation, not for deploying as a team service. You'd need significant engineering work to make this production-ready.

Verdict

Use if: You're building a custom CTI pipeline and want a solid foundation for ATT&CK classification that you can extend and modernize; you're researching active learning approaches for cybersecurity ML; you need to understand the fundamentals of multi-label text classification before jumping to transformers; or you're prototyping automation for a security operations workflow and want something lightweight you can run without GPU infrastructure. Skip if: You need production-ready software with ongoing maintenance and security patches; you want state-of-the-art accuracy and are willing to invest in transformer-based models; you need pre-trained models ready for immediate deployment; or you're looking for a turnkey solution rather than a development starting point. This is an excellent educational resource and proof-of-concept, but treat it as a reference implementation, not a finished product.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/vlegoy-rcatt.svg)](https://starlog.is/api/badge-click/developer-tools/vlegoy-rcatt)