Back to Articles

Skyline: Etsy's Ensemble Voting Approach to Zero-Configuration Anomaly Detection

[ View on GitHub ]

Skyline: Etsy's Ensemble Voting Approach to Zero-Configuration Anomaly Detection

Hook

Most monitoring systems force you to configure thresholds for every metric you track. Skyline takes the opposite approach: throw all your metrics at it, and let statistical consensus decide what's abnormal.

Context

In the early 2010s, companies like Etsy faced a monitoring paradox. As they instrumented more of their infrastructure with StatsD, they generated hundreds of thousands of high-resolution time series. Traditional monitoring required setting thresholds for each metric—an impossible task at that scale. Even worse, static thresholds broke constantly as traffic patterns evolved. You'd either drown in false positives or miss real incidents.

Etsy's solution became Skyline, part of their Kale stack for real-time monitoring. The core insight was borrowed from ensemble machine learning: if you can't trust a single anomaly detection algorithm, run many algorithms and take a vote. By treating anomaly detection as a consensus problem rather than a threshold problem, Skyline could passively monitor metrics as they appeared, without configuration. It represented a shift from proactive threshold engineering to reactive statistical analysis, letting the data speak for itself.

Technical Insight

Output

Detection

Storage

Ingestion

metrics

publish

buffer

write windows

read time series

distribute

distribute

distribute

distribute

vote

vote

vote

vote

consensus reached

query

display

Application Metrics

StatsD/Graphite

Horizon Service

Redis Queue

Redis Time Series Lists

Analyzer Service

Grubbs Test

Histogram Bins

MAD Algorithm

Std Dev Tests

Ensemble Voter

Web Dashboard

Anomaly Alerts

System architecture — auto-generated

Skyline's architecture consists of three components working in concert. Horizon is the ingestion service, listening for metrics via a Redis queue. As metrics flow in from your application (typically via StatsD or Graphite), Horizon buffers them into Redis lists, creating high-resolution time series windows. The Analyzer component then processes these windows, running each metric through an ensemble of statistical algorithms.

The magic happens in the voting mechanism. Each algorithm in the ensemble—Grubbs' test, first-hour average, simple standard deviation, mean subtraction cumulation, histogram bins, and more—examines the time series independently. These aren't sophisticated machine learning models; they're straightforward statistical tests that each capture different anomaly patterns. Grubbs' test catches extreme outliers. Histogram binning detects distribution shifts. The median absolute deviation handles noisy data better than standard deviation.

Here's a simplified example of how the ensemble voting works:

import numpy as np
from scipy import stats

def ensemble_detect(timeseries, consensus_threshold=6):
    """
    Run multiple algorithms and vote on anomaly detection.
    Returns True if consensus threshold is met.
    """
    algorithms = []
    datapoint = timeseries[-1]
    reference = timeseries[:-1]
    
    # Algorithm 1: Grubbs test for outliers
    def grubbs_test():
        mean = np.mean(timeseries)
        std = np.std(timeseries)
        z_score = abs((datapoint - mean) / std)
        return z_score > 3
    
    # Algorithm 2: First hour average deviation
    def first_hour_average():
        if len(timeseries) < 3600:
            return False
        fha = np.mean(timeseries[:3600])
        return abs(datapoint - fha) > 3 * np.std(timeseries)
    
    # Algorithm 3: Median absolute deviation
    def mad_test():
        median = np.median(reference)
        mad = np.median(np.abs(reference - median))
        threshold = 6 * 1.4826 * mad  # 1.4826 normalizes MAD to std
        return abs(datapoint - median) > threshold
    
    # Algorithm 4: Histogram bins (distribution shift)
    def histogram_bins():
        hist, bin_edges = np.histogram(reference, bins=15)
        bin_idx = np.digitize([datapoint], bin_edges)[0] - 1
        if bin_idx < 0 or bin_idx >= len(hist):
            return True
        return hist[bin_idx] == 0  # datapoint in empty bin
    
    # Run all algorithms and count votes
    votes = [
        grubbs_test(),
        first_hour_average(),
        mad_test(),
        histogram_bins()
    ]
    
    anomalies_detected = sum(votes)
    return anomalies_detected >= consensus_threshold

The consensus threshold (typically 6 out of 9-10 algorithms in production Skyline) is the key parameter. Setting it too low generates false positives; too high and you miss real anomalies. Etsy found that requiring a simple majority worked well across diverse metric types—CPU utilization, request rates, error counts, latency percentiles—without per-metric tuning.

The web application component provides visualization, letting you drill into which algorithms voted for an anomaly and inspect the time series context. This transparency is crucial because anomaly detection systems need human override. Sometimes what looks statistically anomalous is actually expected behavior—a planned deployment, a marketing campaign, or legitimate traffic growth.

Skyline's single-server architecture is both a strength and limitation. Everything runs on one box: Redis stores the time series windows in memory, Horizon handles ingestion, and Analyzer spawns worker processes to parallelize algorithm execution across metrics. This simplicity eliminates distributed system complexity but caps throughput. Etsy reported handling 500,000 metrics at one-minute resolution on commodity hardware, but scaling beyond that requires either sampling metrics or reducing resolution—both lossy compromises.

The extensibility model deserves attention. Adding custom algorithms is straightforward—drop a new Python function into the algorithms directory following the established signature. This made Skyline a laboratory for experimenting with domain-specific detection logic. For instance, you might add an algorithm that understands business hour patterns or seasonal trends specific to your application. The ensemble absorbs these additions naturally; if your custom algorithm votes with the majority, it reinforces confidence. If it votes alone, it's outvoted.

Gotcha

The Python scientific stack dependency is Skyline's operational Achilles heel. Installing numpy, scipy, pandas, and statsmodels sounds simple until you encounter version conflicts, missing BLAS/LAPACK libraries, or compilation failures. These libraries require C extensions and specific system dependencies that vary across Linux distributions. In containerized environments, you'll spend time building robust images that don't break on platform updates. The conda ecosystem helps but adds another layer of tooling complexity.

Skyline also suffers from the cold start problem. Algorithms need historical context to establish baselines, but new metrics have no history. The system handles this by being conservative—new metrics won't trigger anomalies until sufficient data accumulates. This means you have a blind spot during the first few hours of monitoring a new service or metric. Similarly, metrics with high variance or non-stationary behavior (think metrics that trend upward as your business grows) generate false positives because the algorithms assume stationarity. You end up excluding these metrics from analysis, reducing coverage. The repository's maintenance status is concerning. With minimal stars and activity, this appears to be a fork of the original Etsy project without ongoing development. The original Skyline dates from 2013, and while the statistical principles remain sound, the ecosystem has moved on. Modern observability platforms offer integrated anomaly detection with better scalability, and the Python monitoring landscape has matured with tools like Prometheus that have massive community support.

Verdict

Use if: you're operating a high-cardinality metrics environment (thousands of time series) where manual threshold configuration is infeasible, you're already invested in the StatsD/Graphite/Redis stack and want to add anomaly detection without rearchitecting, you have Python operational expertise and can manage the scientific library dependencies, and you value algorithmic transparency over black-box machine learning. Skyline excels at passively monitoring diverse metric types without per-metric configuration, and the ensemble voting approach provides explainable results. Skip if: you need distributed processing beyond single-server capacity limits, you're starting fresh and want modern tooling with active maintenance and community support, you require sophisticated time series forecasting rather than simple statistical anomaly detection, or you lack the operational maturity to maintain legacy Python dependencies. Modern alternatives like Prometheus with Alertmanager, or even commercial solutions like Datadog's anomaly detection, provide better long-term maintainability with comparable or superior detection capabilities.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/mynameismeerkat-skyline.svg)](https://starlog.is/api/badge-click/developer-tools/mynameismeerkat-skyline)