Back to Articles

Why Most MAC Address Parsers Are Wrong: Deep Dive into coolbho3k/manuf

[ View on GitHub ]

Why Most MAC Address Parsers Are Wrong: Deep Dive into coolbho3k/manuf

Hook

Your network scanner thinks Apple owns that device, but it's actually a Chinese IoT manufacturer squatting on a /36 subnet. Most MAC address parsers get this wrong—here's why.

Context

Every network interface on Earth has a MAC address, and the first portion typically identifies the manufacturer. This mapping is managed by IEEE, which assigns Organizationally Unique Identifiers (OUIs) to vendors. Wireshark maintains the most comprehensive public database of these mappings, used by millions of network engineers, security researchers, and sysadmins to identify devices on their networks.

But there's a hidden complexity that catches most developers off-guard: not all OUI assignments are the standard 24-bit prefix (/24 netmask). IEEE also assigns MAC Address Block Large (MA-L), MAC Address Block Medium (MA-M with /28), and MAC Address Block Small (MA-S with /36) identifiers. Some entries even have custom netmasks like /45. Most naive parsers simply match the first three bytes, which fails spectacularly when overlapping ranges exist. The coolbho3k/manuf library solves this by implementing the complete netmask-aware lookup algorithm that even Wireshark's own web tool sometimes gets wrong.

Technical Insight

Database

Lookup Algorithm

Parse manuf file

Store key: bits_left, shifted_mac

Right-shift MAC

Dictionary Lookup

Match Found

No Match

Download from Wireshark

MacParser Initialization

Build Dictionary

In-Memory Cache

MAC Address Input

Convert to Integer

Iterate Netmasks 0-48 bits

Generate Key

Return Vendor Info

Update Command

Refresh manuf file

System architecture — auto-generated

The brilliance of manuf lies in its preprocessing strategy. Instead of parsing the database file on every lookup, it builds an in-memory dictionary on initialization where keys are tuples of (bits_left, shifted_mac). The bits_left represents 48 minus the netmask length, and shifted_mac is the MAC address right-shifted by that amount. This transforms variable-length prefix matching into exact dictionary lookups.

Here's how you'd use it in a real-world network scanning scenario:

from manuf import MacParser

# Initialize once - loads the entire database into memory
parser = MacParser()

# Lookup various MAC addresses
devices = [
    '00:00:00:00:00:00',
    '00:1B:63:84:45:E6',  # Apple
    'AC:DE:48:00:11:22',  # Specific MA-S assignment
]

for mac in devices:
    result = parser.get_all(mac)
    print(f"{mac} -> Vendor: {result.manuf}, Comment: {result.comment}")

# Update the database periodically
parser.update()

The lookup algorithm is deceptively clever. For a given MAC address, it iterates through possible netmask lengths from most specific (48 bits) to least specific (0 bits). At each step, it right-shifts the MAC address by the corresponding number of bits and checks if that key exists in the dictionary:

# Simplified version of the core lookup logic
def lookup(self, mac_int):
    for bits_left in range(49):  # 0 to 48
        # Right shift MAC address by bits_left
        shifted = mac_int >> bits_left
        key = (bits_left, shifted)
        
        if key in self.vendors:
            return self.vendors[key]
    
    return None  # No match found

This approach has O(1) average case performance because most MAC addresses match on the standard /24 prefix (bits_left=24), requiring only a few iterations. The worst case is O(48) for completely unrecognized addresses, but that's effectively constant time for practical purposes.

The preprocessing stage is where the magic happens. When parsing the manuf file, each entry like 00:1B:63 Apple gets converted into multiple potential keys. If an entry has a /36 netmask (AC:DE:48:00:00:00/36), the library calculates that bits_left equals 48-36=12, right-shifts the MAC address by 12 bits, and stores the vendor information under the key (12, shifted_value). This means overlapping ranges are handled correctly—the more specific match (fewer bits_left) gets checked first during lookup.

The library also handles the subtle distinction between short names (manuf), longer names (manuf_long), and comments. Wireshark's database format includes all three fields, and manuf preserves this information:

# Get detailed vendor information
result = parser.get_all('00:1B:63:84:45:E6')
print(result.manuf)       # 'Apple'
print(result.manuf_long)  # 'Apple, Inc.'
print(result.comment)     # Specific product line info, if available

One architectural decision worth noting is the dual-licensing under LGPLv3/Apache 2.0, deliberately chosen to avoid GPL contamination despite using Wireshark's GPLv2 database. The author argues that the database itself is factual data (IEEE assignments) and the parsing code is independent, making the library safe for commercial use. This pragmatic licensing choice makes manuf viable for proprietary network monitoring tools and commercial security products.

Gotcha

The entire database loads into memory during initialization, which takes a few hundred milliseconds and consumes several megabytes of RAM. For applications that need to perform a single lookup, this overhead is wasteful—you're loading 40,000+ entries to resolve one MAC address. If you're building a serverless function or memory-constrained embedded system, this approach may not scale well.

The bundled manuf database also goes stale quickly. Wireshark updates their database weekly as IEEE assigns new OUIs, but manuf ships with a snapshot from whenever the package was last released. The library includes an update() method to fetch the latest database from Wireshark's git repository, but this requires manual intervention or scheduled automation. If you're identifying newly released devices, you might get 'Unknown' results until you update. Additionally, the update mechanism pulls directly from Wireshark's repository, meaning your application needs outbound HTTPS access and depends on their infrastructure availability. There's no fallback or mirror system.

Verdict

Use if: You're building network analysis tools, security monitoring systems, or device discovery applications where accurate MAC-to-vendor mapping matters and you'll perform multiple lookups per session. The memory-speed tradeoff is reasonable for the small database size, and correct netmask handling is essential for identifying modern IoT devices with non-standard assignments. Also use this if you need a GPL-free solution for commercial products—the dual licensing is a significant advantage. Skip if: You only need occasional lookups (cloud APIs like macvendors.com are simpler), you're in extremely memory-constrained environments (streaming parsers would be better), or you only care about the most common vendors with standard /24 prefixes (basic substring matching suffices). Also skip if you need real-time database freshness without managing updates yourself—cloud-based solutions handle that automatically.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/coolbho3k-manuf.svg)](https://starlog.is/api/badge-click/data-knowledge/coolbho3k-manuf)