Back to Articles

PIGEON: How CLIP Embeddings Predict GPS Coordinates from Photos (And Why the Code Won't Help You Build It)

[ View on GitHub ]

PIGEON: How CLIP Embeddings Predict GPS Coordinates from Photos (And Why the Code Won’t Help You Build It)

Hook

A CVPR 2024 paper can pinpoint where you took a photo with disturbing accuracy, yet its authors deliberately crippled their own open-source release. The reason reveals everything wrong—and right—about publishing cutting-edge computer vision research.

Context

Image geolocation has haunted privacy researchers for years. Give a model a photograph, and it predicts GPS coordinates—no EXIF metadata required. Early approaches relied on landmark databases or image retrieval against geotagged collections, but these methods struggled with non-iconic locations. You could find the Eiffel Tower, but not a random street in suburban Tokyo.

The breakthrough came from reframing the problem. Instead of searching databases or regressing coordinates directly, PIGEON treats Earth as a massive multi-class classification problem. The planet gets divided into hierarchical geocells—think geographic quadtrees—and the model learns to predict which cells contain the image at multiple zoom levels. This approach, combined with CLIP’s powerful visual representations trained on internet-scale data, achieved state-of-the-art results across standard benchmarks. But here’s the twist: the authors released the paper, the methodology description, and inference code—but deliberately withheld the trained weights, training data, and even the geocell definitions. They published the research, then immediately neutered its reproducibility. The repository exists as academic documentation, not a usable artifact.

Technical Insight

Adaptive Partitioning

Input Image

CLIP ViT-L/14

Vision Encoder

frozen

Image Embeddings

768-dim

Level 0 Head

Continent Scale

Level 1 Head

Country Scale

Level 2 Head

Region Scale

Level 3 Head

Locality Scale

Hierarchical

Geocell Fusion

GPS Coordinates

lat/lon

Training Data

Density

Dynamic Geocell

Grid Generation

System architecture — auto-generated

PIGEON’s architecture elegance lies in its simplicity. At its core, it’s a frozen CLIP ViT-L/14 vision encoder feeding into a cascade of classification heads. Each head predicts geocell membership at a different hierarchical level—coarse continent-level predictions narrow to country, then region, then precise locality. The model never learns to output latitude/longitude directly. Instead, it learns: ‘This image probably belongs to geocell 42 at zoom level 3, which encompasses central Paris.’

The hierarchical geocell partitioning strategy is critical. Unlike uniform grid approaches that waste capacity on oceans or deserts, PIGEON uses adaptive partitioning based on training data density. Regions with many geotagged photos (cities, tourist destinations) get fine-grained cells; sparse regions get coarser divisions. This prevents class imbalance from destroying performance and focuses model capacity where it matters.

Here’s how the inference pipeline combines predictions across scales:

import torch
import clip
from pigeon.model import GeoClassifier
from pigeon.geocells import HierarchicalGeocells

# Load CLIP encoder (frozen, never fine-tuned)
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, preprocess = clip.load("ViT-L/14", device=device)
clip_model.eval()

# Classification heads for each zoom level
# (Note: actual weights/geocell definitions withheld by authors)
geo_heads = [
    GeoClassifier(input_dim=768, num_classes=100),    # Level 0: Continental
    GeoClassifier(input_dim=768, num_classes=1000),   # Level 1: Country
    GeoClassifier(input_dim=768, num_classes=10000),  # Level 2: Regional
    GeoClassifier(input_dim=768, num_classes=50000),  # Level 3: Local
]

def predict_location(image_path, top_k=5):
    # Extract CLIP features
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
    
    with torch.no_grad():
        features = clip_model.encode_image(image)
        features = features / features.norm(dim=-1, keepdim=True)
    
    # Hierarchical prediction
    predictions = []
    for level, head in enumerate(geo_heads):
        logits = head(features)
        probs = torch.softmax(logits, dim=-1)
        top_cells = torch.topk(probs, k=top_k)
        predictions.append(top_cells)
    
    # Combine predictions: fine-grained cells must be children
    # of coarse-grained predictions
    valid_cells = filter_hierarchical_consistency(predictions)
    
    # Convert final geocells to lat/lon coordinates
    coordinates = geocells_to_coordinates(valid_cells)
    return coordinates

The hierarchical consistency filtering is where PIGEON shows sophistication. If the coarse classifier predicts ‘Europe’ with high confidence, but the fine classifier predicts a cell in ‘Asia,’ the system recognizes the inconsistency and down-weights that prediction. This prevents the common failure mode where models confidently hallucinate precise coordinates in completely wrong continents.

Training uses a weighted cross-entropy loss across all hierarchical levels simultaneously. The loss function balances contributions from each zoom level—coarse predictions get lower weight since they’re easier, while fine-grained cells receive higher weight. This multi-task learning setup means a single forward pass trains all classification heads together, sharing the burden of learning geographically discriminative features.

The frozen CLIP encoder is the secret sauce. PIGEON never backpropagates through the vision transformer. Instead, it treats CLIP as a fixed feature extractor and only trains the lightweight classification heads on top. This design choice brings three advantages: (1) drastically reduced training compute—you’re only updating a few million parameters instead of hundreds of millions, (2) leverages CLIP’s existing knowledge of geographic landmarks, architectural styles, vegetation patterns, and cultural artifacts learned from internet-scale data, and (3) enables rapid experimentation with different geocell partitioning strategies without retraining the expensive encoder.

What makes this particularly clever is that CLIP was never explicitly trained for geolocation. Its vision-language pretraining just happened to encode geographic information as a side effect of learning to match images with text descriptions mentioning places, landmarks, and locations. PIGEON exploits this emergent capability.

Gotcha

The repository’s biggest limitation isn’t technical—it’s intentional sabotage. The authors provide no trained weights, no training data splits, and no geocell definitions. The code structure exists, but you cannot reproduce the paper’s results or deploy the system without recreating everything from scratch. This requires assembling millions of geotagged images (likely from YFCC100M or similar datasets), defining your own geocell partitioning strategy, and training for thousands of GPU-hours. The authors estimated their full training run consumed significant compute resources on high-end GPUs.

This wasn’t an oversight. After their paper demonstrated state-of-the-art geolocation accuracy, NPR and privacy advocates highlighted the dystopian implications: authoritarian regimes could geolocate dissidents from photos, stalkers could find victims from social media images, journalists’ sources could be exposed. The authors made the ethical choice to publish the methodology for academic scrutiny while preventing easy weaponization. But it means this repository serves as documentation, not a usable tool.

Even with full access, PIGEON has inherent limitations. Performance degrades catastrophically on indoor scenes lacking windows or distinctive architectural features—the model has no geographic cues to latch onto. Generic office interiors could be anywhere. Similarly, extreme close-ups of objects, heavily filtered images, or nighttime photos without landmarks confuse the system. The model also inherits CLIP’s Western/tourist-destination bias; obscure rural locations in underrepresented regions likely see worse performance due to training data sparsity. Images near geocell boundaries create ambiguity—a photo could legitimately belong to adjacent cells, but the classification framework forces a discrete choice.

Verdict

Use if: You’re an academic researcher studying geolocation methods and need to understand current state-of-the-art architectural approaches. The hierarchical geocell classification framework and frozen CLIP encoder strategy offer valuable insights for related vision tasks (building recognition, landmark retrieval, cultural heritage classification). Use this as a reference architecture if you’re designing similar geographic prediction systems and have access to appropriate training data and computational resources. The codebase demonstrates clean separation between feature extraction, classification, and hierarchical reasoning worth studying. Skip if: You need a production geolocation system or want to experiment with image-based coordinate prediction without massive infrastructure. The deliberately withheld components mean you’re looking at an empty shell. For practical geolocation, use commercial APIs (Google Vision, Amazon Rekognition) or wait for more open alternatives like GeoCLIP that may release full weights. Skip if ethical concerns about enabling surveillance bother you—even recreating this for ‘harmless’ uses like geotagging personal photos contributes to dual-use technology that can be weaponized. The authors’ choice to limit release speaks volumes about the responsibility of publishing powerful computer vision research.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/lukashaas-pigeon.svg)](https://starlog.is/api/badge-click/developer-tools/lukashaas-pigeon)