> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

Dyana: Sandboxing Untrusted ML Models and Executables with eBPF Telemetry

[ View on GitHub ]

Dyana: Sandboxing Untrusted ML Models and Executables with eBPF Telemetry

Hook

Loading a pickle file executes arbitrary Python code. Running an unknown ML model could exfiltrate your training data. Yet teams do both constantly, hoping for the best.

Context

The machine learning supply chain has become a minefield. Models shared on Hugging Face, pickle files from colleagues, pre-trained weights from GitHub—all potential attack vectors. Unlike traditional software where source review is standard practice, ML practitioners routinely execute serialized artifacts containing arbitrary code. The pickle format, Python's default serialization mechanism, is particularly notorious: unpickling calls __reduce__ methods that can spawn shells, delete files, or exfiltrate credentials before your model even loads.

Historically, analyzing untrusted executables required specialized malware sandboxes like Cuckoo, designed for Windows PE files and x86 binaries. Meanwhile, ML security remained an afterthought. Dyana bridges this gap by providing a unified sandbox that treats ML models, pickle files, JavaScript, ELF binaries, and other formats as first-class citizens. Built on Docker for isolation and Tracee for kernel-level telemetry, it captures what matters: filesystem touches, network connections, GPU memory consumption, and security-relevant syscalls—all without modifying the code being analyzed.

Technical Insight

Dyana's architecture revolves around specialized loaders executing inside ephemeral Docker containers, instrumented by Aqua Security's Tracee eBPF tracer. When you submit a file, Dyana selects the appropriate loader (PyTorch, TensorFlow, pickle, ELF, etc.), spins up a container with necessary dependencies, attaches Tracee for syscall monitoring, and executes the payload while collecting telemetry.

The core abstraction is beautifully simple. Here's what analyzing a suspicious pickle file looks like:

from dyana import Sandbox

# Create sandbox with GPU access and network isolation
sandbox = Sandbox(
    enable_gpu=True,
    network_mode="none",
    timeout=60
)

# Run the pickle file and collect telemetry
result = sandbox.run(
    "suspicious_model.pkl",
    loader="pickle"
)

# Examine what it tried to do
print(f"Files accessed: {result.files_touched}")
print(f"Network attempts: {result.network_events}")
print(f"GPU memory peak: {result.gpu_memory_mb}")
print(f"Security events: {result.security_alerts}")

Under the hood, Tracee's eBPF probes hook into kernel tracepoints and kprobes, capturing events at the syscall boundary. This approach avoids the performance overhead and detection evasion issues of ptrace-based tools like strace. More critically, eBPF monitoring happens in kernel space—malicious code in the container cannot tamper with the instrumentation itself. Tracee captures events like openat, connect, execve, and security-sensitive operations like ptrace or kernel module loading attempts.

The GPU profiling capability is particularly noteworthy for ML use cases. Dyana wraps NVIDIA's NVML library to sample GPU memory consumption throughout execution. This reveals models that unexpectedly allocate massive VRAM—potentially concealing secondary models for data exfiltration or models that will OOM your production servers. Combined with filesystem telemetry, you can correlate GPU allocation spikes with file writes, catching models that render watermarked images or dump embeddings to disk.

Dyana's loader system is extensible. Each loader is a Python class implementing a standard interface: prepare() to set up the environment, execute() to run the payload, and cleanup() for teardown. The PyTorch loader, for instance, calls torch.load() with weights_only=True where possible, then optionally runs inference with dummy inputs to trigger execution graphs. The ELF loader uses subprocess.run() with resource limits. Adding support for new formats means implementing this interface and registering the loader—no changes to the core sandbox engine.

The telemetry output is structured JSON containing arrays of events with timestamps, process trees, and contextual data. This enables downstream analysis: piping results to SIEM systems, training anomaly detection models on execution patterns, or building automated gatekeeping for ML artifact registries. The event stream includes not just what happened, but process ancestry, container metadata, and correlated GPU metrics—context that transforms raw syscalls into actionable intelligence.

Gotcha

Dyana's Linux-only, Docker-dependent design is a showstopper for many environments. You need a host with Docker, kernel 4.18+ for eBPF support, and sufficient privileges to run containers with required capabilities. Windows and macOS are non-starters unless you're willing to run a Linux VM, which adds another isolation layer and complicates GPU passthrough. Teams on locked-down corporate laptops or restricted cloud environments may find the infrastructure requirements prohibitive.

Performance overhead is non-trivial. Containerization adds seconds of startup latency, and eBPF tracing, while lighter than ptrace, still incurs costs—particularly for I/O-heavy workloads that generate millions of syscall events. For large ML models with multi-minute load times, this overhead is negligible. But for quick scripts or models under 100MB, you're adding 20-30% overhead. The system also struggles with long-running analyses; Tracee generates substantial data volumes that must be buffered and parsed, potentially causing memory pressure on the host. There's no built-in support for distributed tracing or horizontal scaling, so analyzing large model repositories means writing your own orchestration layer. Documentation, while improving, remains sparse on advanced use cases like custom loaders or integrating Tracee signatures for specific threat detection.

Verdict

Use if: You're building ML supply chain security tooling, need to vet user-uploaded models in a platform, operate a malware analysis pipeline that needs to handle diverse file types, or run security research requiring detailed execution telemetry with GPU metrics. Dyana excels when isolation and observability matter more than raw performance—think CI/CD gates for model registries, threat intel collection, or incident response workflows where you must safely detonate suspicious artifacts. Skip if: You need Windows/macOS support, lack Docker infrastructure or eBPF-capable kernels, require near-native performance, are analyzing only trusted code, or want a batteries-included malware analysis suite with GUI and reporting (Cuckoo is better there). For simple Python sandboxing without ML focus, Firejail or Bubblewrap offer lighter-weight alternatives. Dyana is purpose-built for the intersection of ML and security—if that's not your domain, the complexity isn't justified.