Back to Articles

Building Shazam from Scratch: Inside SeekTune's Audio Fingerprinting Pipeline

[ View on GitHub ]

Building Shazam from Scratch: Inside SeekTune’s Audio Fingerprinting Pipeline

Hook

Shazam can identify a song from a 3-second clip recorded in a noisy bar, matching it against millions of tracks in under a second. The algorithm that makes this possible has been public knowledge since 2003, yet open-source implementations remain surprisingly rare.

Context

Audio fingerprinting solves a deceptively hard problem: how do you match a degraded, partial recording against a massive database when you can’t rely on exact byte matching? Traditional approaches like waveform comparison fail catastrophically with background noise, different recording qualities, or even slight pitch variations. Shazam’s breakthrough, detailed in Avery Wang’s 2003 paper “An Industrial-Strength Audio Search Algorithm,” was recognizing that certain frequency-time constellation patterns remain remarkably stable across different recordings of the same song.

Before this approach, audio recognition systems relied on metadata matching (error-prone), acoustic fingerprinting that required high-quality samples (impractical for real-world use), or melody-based matching (required isolated vocals). Wang’s insight was that robust recognition doesn’t need to capture everything about a song—it just needs features that survive noise, compression, and distortion while remaining unique enough to distinguish between tracks. SeekTune brings this algorithm into the open-source Go ecosystem, providing not just a library but a complete working system with database backends, API integrations, and a web interface.

Technical Insight

Mono PCM

Spectrogram

Peak Points

Fingerprints + Time Offsets

Hash Matches

Query Hashes

Time-Offset Patterns

Metadata

Download

API

Audio Input

FFmpeg Processor

FFT Engine

Constellation Mapper

Hash Generator

Database

SQLite/MongoDB

Query Audio

Score Calculator

Match Result

Spotify API

YT-DLP

React Frontend

Go Backend

System architecture — auto-generated

The core of SeekTune’s architecture is a five-stage pipeline that transforms audio into searchable fingerprints. First, FFmpeg converts incoming audio files into a consistent format (mono channel, specific sample rate). The raw audio data then passes through a Fast Fourier Transform implementation that generates spectrograms—visual representations where time runs along one axis, frequency along another, and intensity represents energy at that frequency-time point.

The critical innovation happens in the constellation mapping phase. Rather than trying to fingerprint every frequency at every moment, SeekTune identifies local maxima—peaks in the spectrogram where energy is higher than surrounding frequencies. These become anchor points. The system then pairs each anchor with target points that appear shortly after it in time, creating what Wang called “combinatorial hashing.” Here’s a simplified example of how the fingerprint generation works:

type Peak struct {
    Frequency int
    Time      int
    Amplitude float64
}

type Fingerprint struct {
    Hash      string
    TimeOffset int
    SongID    string
}

func GenerateFingerprints(peaks []Peak, songID string) []Fingerprint {
    fingerprints := []Fingerprint{}
    
    for i, anchor := range peaks {
        // Look ahead for target points within a time window
        for j := i + 1; j < len(peaks) && peaks[j].Time - anchor.Time < targetZone; j++ {
            target := peaks[j]
            
            // Create hash from: anchor_freq | target_freq | delta_time
            hash := fmt.Sprintf("%d|%d|%d", 
                anchor.Frequency, 
                target.Frequency, 
                target.Time - anchor.Time)
            
            fingerprints = append(fingerprints, Fingerprint{
                Hash:       hash,
                TimeOffset: anchor.Time,
                SongID:     songID,
            })
        }
    }
    return fingerprints
}

The genius of this approach is in what gets hashed: the frequencies of two points and their time separation. This triplet remains consistent even when a song is played in a noisy environment or recorded at different volumes, because relative frequencies and time relationships don’t change. SeekTune stores these fingerprints in either SQLite (for development and smaller databases) or MongoDB (for scale), with the SongID and TimeOffset preserved.

During recognition, the same pipeline processes the incoming audio snippet, generating fingerprints that are queried against the database. The matching algorithm counts hash collisions per song, but crucially, it also checks for consistency in time offsets. If multiple hashes from the query audio match a particular song, and they all suggest the query audio started at the same point in the original track (consistent time offset delta), confidence skyrockets. A true match typically produces thousands of aligned hash matches, while random collisions from different songs produce only scattered, inconsistent matches.

SeekTune’s database schema reflects this approach. For SQLite, it creates an index on the hash column for fast lookups, with additional columns for song_id and time_offset. MongoDB users get similar indexing with additional benefits for horizontal scaling. The scoring function is straightforward: group matches by song_id and time_offset_delta, then count occurrences. The song with the highest count of consistent offsets wins, often by orders of magnitude.

The architecture also demonstrates thoughtful integration patterns. Rather than bundling audio download and metadata retrieval into the core algorithm, SeekTune cleanly separates these concerns. The YT-DLP integration handles YouTube audio extraction, the Spotify API provides rich metadata (artist, album, cover art), and the fingerprinting engine remains agnostic to audio sources. This separation makes the codebase more testable and allows users to swap out components—you could easily replace YT-DLP with direct file uploads or streaming sources without touching the recognition pipeline.

Gotcha

The external dependency chain is SeekTune’s biggest operational challenge. You need FFmpeg compiled with the right codecs, Node.js for the React frontend, YT-DLP (which itself requires Python) for YouTube integration, and either SQLite or MongoDB. On a fresh Linux server, you’re looking at 30+ minutes of setup before you can process your first song. Containerization helps, but the multi-language stack (Go, Python, JavaScript) makes the Docker image relatively heavy and complicates debugging when something breaks in the pipeline.

Performance characteristics remain largely undocumented, which matters for practical deployment. How many songs can SQLite handle before query times degrade? At what database size should you migrate to MongoDB? What’s the memory footprint when fingerprinting a 5-minute song versus a 45-minute DJ mix? The repository includes a working demo but lacks the benchmarks and capacity planning guidance you’d need for anything beyond personal experimentation. The YouTube integration via YT-DLP is also inherently fragile—YouTube regularly updates their site to break scrapers, and while YT-DLP maintainers are responsive, you’re essentially depending on a cat-and-mouse game continuing in your favor. For production use, you’d want to decouple from YouTube entirely or add robust fallback mechanisms.

Verdict

Use SeekTune if you’re building custom audio recognition features and want full control over the pipeline, studying audio fingerprinting algorithms with a production-quality implementation, or need a self-hosted solution where data privacy matters (processing audio locally rather than sending to third-party APIs). It’s also excellent for educational purposes—the codebase is clean enough to read through in an afternoon, and watching the constellation mapping generate fingerprints builds genuine intuition for how Shazam-style recognition works. Skip it if you need production-grade recognition at scale without investing in infrastructure (commercial APIs like Shazam or AudD are more appropriate), require mobile or embedded deployment (the FFmpeg and multi-runtime dependencies make this impractical), or want something that just works out of the box (the setup complexity demands comfortable systems administration). This is a tool for developers who want to understand the magic, not just consume it.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/cgzirim-seek-tune.svg)](https://starlog.is/api/badge-click/developer-tools/cgzirim-seek-tune)