Dablooms: When Bloom Filters Need to Forget

Hook

Traditional Bloom filters can only answer 'maybe yes' or 'definitely no'—they can never forget. Dablooms solved this with a deceptively simple trick: 4-bit counters that turn probabilistic sets into probabilistic multisets with deletion support.

Context

Bloom filters are elegant: a probabilistic data structure that tells you if an element might be in a set, using a fraction of the memory a hash table would require. But they have a fatal flaw for many real-world applications—once you add an element, you can never remove it. Setting bits is a one-way operation.

Bitly faced this limitation when building systems that needed to track millions of URLs with the ability to remove entries as links expired or were deleted. They needed the space efficiency of Bloom filters but with deletion support and the ability to scale beyond initially allocated capacity. The result was dablooms, a counting Bloom filter implementation that uses 4-bit counters instead of single bits, stores data in memory-mapped files for persistence, and automatically scales through cascading filter layers. While the project is now unmaintained, its architecture demonstrates sophisticated approaches to crash recovery and capacity scaling that remain relevant for anyone building probabilistic data structures.

Technical Insight

System architecture — auto-generated

The core innovation in dablooms is replacing each bit in a traditional Bloom filter with a 4-bit counter. When you add an element, you hash it to k positions and increment the counters at those positions. When you remove an element, you decrement those same counters. An element is 'probably present' only if all k counters are non-zero.

Why 4 bits? It's a carefully considered trade-off. Four bits give you a maximum count of 15, which means up to 15 hash collisions at a single position before the counter saturates. In practice, with properly tuned parameters, hitting this limit is rare enough that it doesn't significantly impact false positive rates. Meanwhile, 4 bits uses only 4x the memory of a traditional Bloom filter—far less than the alternatives like 8-bit or 16-bit counters, and dramatically less than a full hash table.

Here's what basic usage looks like:

#include <dablooms.h>

// Create a scaling Bloom filter
// capacity: initial 100k elements
// error_rate: 0.05 (5% false positive rate)
// filepath: persisted to disk via mmap
scaling_bloom_t *bloom = new_scaling_bloom(
    100000,    // initial capacity
    0.05,      // error rate
    "urls.bloom"
);

// Add elements
scaling_bloom_add(bloom, "https://bit.ly/example", 23, 1);
scaling_bloom_add(bloom, "https://bit.ly/test", 20, 1);

// Check membership (returns 1 if probably present, 0 if definitely not)
int present = scaling_bloom_check(bloom, "https://bit.ly/example", 23);

// Remove elements (the key feature)
scaling_bloom_remove(bloom, "https://bit.ly/example", 23, 1);

// Clean up
free_scaling_bloom(bloom);

The memory-mapped file approach is where dablooms gets interesting for production systems. Instead of manual serialization, the entire filter lives in a file mapped directly into the process's address space. Additions and removals modify the mapped memory, and the OS handles flushing changes to disk asynchronously. This gives you persistence with minimal overhead—no explicit save/load operations needed.

But memory mapping introduces a critical problem: what happens if your process crashes between a memory write and the OS flushing that page to disk? Dablooms solves this with a dual sequence number system. The structure maintains both mem_seqnum (incremented on every write to memory) and disk_seqnum (the sequence number of the last successful flush). On startup, if these numbers don't match, you know the filter experienced an unclean shutdown and may be inconsistent.

The scaling mechanism uses cascading filters. When the initial filter reaches capacity, dablooms creates a new, larger filter and chains it to the first. Lookups check all filters in the chain. This means capacity can grow from the initial 100k to ~500k elements without requiring a complete rebuild. Each new filter in the cascade is sized to accommodate the total number of elements added so far, creating logarithmic growth in the number of filters needed.

The implementation detail that reveals real C craftsmanship is how counters are packed. Since 4 bits doesn't align with byte boundaries, dablooms packs two counters per byte. Accessing a counter requires bit masking operations:

// Simplified version of counter access
uint8_t get_counter(uint8_t *array, size_t index) {
    size_t byte_index = index / 2;
    if (index % 2 == 0) {
        return array[byte_index] & 0x0F;  // Lower 4 bits
    } else {
        return (array[byte_index] >> 4) & 0x0F;  // Upper 4 bits
    }
}

This bit-level packing is why dablooms is implemented in C rather than a higher-level language—you need precise control over memory layout to achieve the space efficiency that makes counting Bloom filters practical.

Gotcha

The elephant in the room is that dablooms is explicitly unmaintained. The repository README states this clearly, and the last meaningful commit was years ago. This isn't just a theoretical concern—it means no security patches, no bug fixes, and no support for modern toolchains. You're inheriting technical debt the moment you add it as a dependency.

The thread-safety issue is equally significant. Dablooms provides zero concurrency guarantees—if you need multiple threads accessing the same filter, you must implement your own locking. The memory-mapped approach makes this trickier than you might expect, because even read operations can trigger page faults that modify process state. You'll need coarse-grained locking around all filter operations, which can become a bottleneck in high-concurrency scenarios. The Python bindings only support Python 2.x, which reached end-of-life in January 2020, and the 4-bit counter limit means you can't safely use this for cases where individual positions might see more than 15 collisions without degrading accuracy.

Verdict

Use if: You're building a C-based system that needs space-efficient element tracking with deletion support, you're comfortable forking and maintaining legacy code yourself, or you're studying counting Bloom filter implementations to understand production-tested approaches to crash recovery and capacity scaling. The architecture is genuinely clever and the sequence number approach to consistency is textbook-worthy. Skip if: You're starting a new project and need active maintenance, require Python 3+ support, need thread-safe concurrent access out of the box, or work in environments where taking on unmaintained dependencies creates compliance or security review issues. Modern alternatives like Redis with RedisBloom or Cuckoo filters give you similar capabilities with active communities behind them.

Dablooms: When Bloom Filters Need to Forget

Dablooms: When Bloom Filters Need to Forget

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Dablooms: When Bloom Filters Need to Forget

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]