Sysdig Inspect: Forensic Analysis of Container Workloads Through System Call Archaeology
Hook
Most observability tools show you what your application thinks happened. Sysdig Inspect shows you what the kernel actually witnessed—every byte read, every connection opened, every file touched—even if your application never logged it.
Context
Traditional application monitoring relies on what developers remember to instrument: logs, metrics, traces. But when a container crashes without logging, when network requests mysteriously fail, or when a security breach leaves minimal artifacts, you're left reconstructing events from fragmentary evidence. Container orchestration amplifies this problem—ephemeral pods disappear, taking their runtime state with them.
Sysdig Inspect emerged from this observability gap. Rather than asking applications to self-report, it analyzes captures from the sysdig CLI tool, which records system calls at the kernel level using eBPF or kernel modules. Think of it as a flight recorder for Linux systems: it preserves everything the kernel sees in binary .scap files, then provides an Electron-based UI to make sense of that firehose. This matters especially for containers, where traditional debugging tools struggle with namespace isolation, overlay filesystems, and the sheer density of microservice interactions.
Technical Insight
Sysdig Inspect's architecture revolves around transforming low-level system call telemetry into navigable layers of abstraction. The tool doesn't connect to live systems—instead, it operates on pre-recorded .scap files generated by the sysdig CLI. This separation of capture from analysis is deliberate: kernel-level tracing is expensive, so you capture during the problem window, then analyze offline without impacting production.
The UI presents three progressively detailed views. Overview tiles show aggregate metrics across the capture period—CPU by container, network I/O by process, file operations by path. These tiles use microtrend sparklines that reveal sub-second patterns invisible to traditional monitoring. A CPU spike lasting 200 milliseconds might seem insignificant in a one-minute aggregation but could indicate a tight loop or lock contention.
Clicking a tile drills into time series views where you select specific time ranges and filter by containers, processes, or file descriptors. Here's where container-native features shine. A typical workflow:
// Example: Analyzing mysterious network latency in a Pod
// 1. Filter to specific Kubernetes namespace/pod
filter: k8s.ns.name = "production" and k8s.pod.name contains "api-server"
// 2. Look at network I/O by connection
// Each row shows: remote endpoint, bytes in/out, latency distribution
// Notice one connection to 10.0.45.23:5432 has P99 latency of 850ms
// 3. Drill to system calls for that file descriptor
// fd.name = "10.0.45.23:5432"
// View shows: connect() took 2ms, but read() calls stall for 800ms+
// 4. Inspect actual payload bytes
// See the SQL query: "SELECT * FROM users WHERE..."
// followed by 847ms gap before response
The deepest layer is the system call stream, a chronological list of every syscall for filtered processes. This is where forensic analysis happens. Each syscall row shows timestamp (microsecond precision), process/thread, call name, arguments, return value, and duration. For I/O syscalls like read(), write(), send(), recv(), you can view the actual payload bytes transmitted—both as hex dump and ASCII interpretation.
This payload capture is Sysdig Inspect's killer feature for security investigation. Suppose you're investigating a suspected data exfiltration. You load a capture from the suspicious timeframe, filter to the container in question, and examine network I/O. The tool shows every outbound connection with full payload visibility. You might discover:
# System call view reveals:
[09:23:41.384729] app-pod:443 > connect(fd=47, addr=185.220.101.15:443) = 0
[09:23:41.422103] app-pod:443 > write(fd=47, size=245)
Payload: "POST /upload HTTP/1.1\r\nHost: suspicious-domain.com\r\n..."
[09:23:41.422891] app-pod:443 > write(fd=47, size=4096)
Payload: <binary data containing customer records>
No application logs captured this—the attacker used a compromised library that opened raw sockets. But the kernel saw everything.
Under the hood, Sysdig Inspect is an Electron app (for desktop) or containerized web app, both consuming the same JavaScript parsing libraries. The .scap file format is a binary stream of event structures containing syscall metadata and optional payload buffers. The parser reads these events and constructs indexes by container ID, process ID, file descriptor, and timestamp to enable fast filtering. The microtrend visualizations use canvas-based rendering to handle thousands of datapoints—a one-hour capture at default settings might contain millions of syscalls.
The tool's container-awareness comes from parsing cgroup metadata embedded in syscall events. When sysdig captures from a Kubernetes node, it enriches events with labels like pod name, namespace, deployment, and container image. Sysdig Inspect surfaces these as first-class filters, so you can analyze "all containers from deployment/frontend" without manually mapping PIDs to containers.
One architectural limitation: the tool loads the entire capture file into memory for analysis. A 10GB .scap file requires substantial RAM, though the UI implements lazy-loading of payload data. The trade-off enables fast scrubbing through timeline and responsive filtering, but makes analyzing multi-hour full-system captures challenging on typical development machines.
Gotcha
The biggest gotcha is the two-step workflow: capture with sysdig CLI, then analyze with Sysdig Inspect. If you didn't capture during the incident window, you have no data—unlike time-series databases that continuously ingest telemetry. This makes it poorly suited for "always-on" monitoring. You need to predict when to capture or implement automated capture triggers when anomalies occur.
Capture overhead is significant with full payload capture enabled. Recording every byte from busy web servers can generate gigabytes per minute and measurably impact application latency. The sysdig CLI offers filters to reduce scope (specific containers, syscalls, or ports), but configuring appropriate filters requires understanding your system's behavior beforehand—a chicken-and-egg problem during troubleshooting. Many teams capture at reduced snaplen (limiting payload bytes) or sample only percentage of events, sacrificing completeness for performance.
Storage and retention become operational challenges. A five-minute full-system capture from a 32-core Kubernetes node might produce a 15GB file. Unlike metrics that compress well, syscall streams with payloads are entropy-heavy and resist compression. Teams often capture locally and delete after immediate analysis rather than archiving, which limits long-term forensic capabilities. The tool also lacks collaboration features—sharing analysis means sharing multi-gigabyte files, and there's no way to annotate findings or export investigation results beyond screenshots.
Verdict
Use if: You need post-mortem forensic analysis of containerized systems where traditional logs are insufficient—unexplained crashes, intermittent performance issues, security breaches, or compliance investigations requiring proof of data access. Especially valuable when troubleshooting problems that span application and infrastructure boundaries, like mysterious network errors that could be app bugs, DNS issues, or kernel network stack problems. The payload capture and microsecond-level detail are unmatched for deep investigations. Skip if: You need real-time alerting, continuous monitoring dashboards, or work primarily outside Linux container environments. The manual capture workflow and storage overhead make it impractical for routine operational monitoring—use Prometheus/Grafana or commercial APM for that. Also skip if your systems handle sensitive data without strong access controls, since .scap files contain plaintext payloads of everything captured, including credentials and PII.