Back to Articles

PIGEON: Teaching CLIP to Geolocate Photos Through Hierarchical Geocell Classification

[ View on GitHub ]

PIGEON: Teaching CLIP to Geolocate Photos Through Hierarchical Geocell Classification

Hook

A computer vision model so accurate at determining where photos were taken that its creators intentionally withheld the training data and model weights due to privacy concerns—and NPR ran a story about it.

Context

Photo geolocation has historically been a challenging computer vision problem. Traditional approaches like IM2GPS relied on matching visual features against geotagged reference datasets, essentially performing sophisticated image similarity searches. These methods struggled with generalization: they worked reasonably well for distinctive landmarks but failed on generic landscapes, street scenes, or interior shots that could exist anywhere.

The fundamental challenge is that geolocation requires correlating subtle visual cues—architectural styles, vegetation patterns, infrastructure design, weather conditions—with specific geographic regions. Early CNN-based methods like GeoEstimation improved on feature-matching approaches by learning representations directly from coordinates, but they still required massive amounts of training data and struggled with the long-tail distribution of global geography. PIGEON, developed by Lukas Haas and presented at CVPR 2024, takes a different approach: instead of treating geolocation as a regression problem (predicting lat/long directly) or a retrieval problem (finding similar images), it reformulates it as hierarchical classification over discretized geographic cells, leveraging CLIP's multimodal pretraining to provide geographic priors that traditional vision-only models lack.

Technical Insight

PIGEON's core innovation is treating the Earth as a hierarchical grid of geocells and fine-tuning CLIP to classify images into these regions. The hierarchical structure is critical: rather than forcing the model to distinguish between all possible locations globally at once, it learns to make progressively finer geographic distinctions. At the coarsest level, it might distinguish continents; at finer levels, countries, states, and eventually local regions.

The architecture builds on OpenAI's CLIP, which was pretrained on 400 million image-text pairs to learn aligned vision-language representations. PIGEON's insight is that CLIP's understanding of concepts like "European architecture," "tropical vegetation," or "desert landscape" implicitly encodes geographic information. By fine-tuning CLIP's vision encoder on geocell classification, PIGEON transforms these implicit associations into explicit location predictions.

While the repository doesn't include training code or model weights, the inference structure reveals the approach. The system generates image embeddings through CLIP's vision encoder, then matches these embeddings against learned geocell representations:

# Conceptual inference flow (adapted from repository structure)
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

class PigeonGeolocator:
    def __init__(self, model_path, geocell_index):
        # Load fine-tuned CLIP model
        self.model = CLIPModel.from_pretrained(model_path)
        self.processor = CLIPProcessor.from_pretrained(model_path)
        
        # Geocell embeddings learned during training
        self.geocell_embeddings = torch.load(geocell_index)
        
    def predict_location(self, image_path):
        # Process image through CLIP vision encoder
        image = Image.open(image_path)
        inputs = self.processor(images=image, return_tensors="pt")
        
        with torch.no_grad():
            image_features = self.model.get_image_features(**inputs)
            image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        
        # Match against geocell embeddings
        similarities = image_features @ self.geocell_embeddings.T
        top_geocells = similarities.topk(k=5)
        
        # Convert geocells to coordinates
        predictions = self.geocells_to_coordinates(top_geocells.indices)
        return predictions

The training process involves creating geocell labels from geotagged images, then fine-tuning CLIP with a classification objective. The hierarchical aspect means that during training, images are labeled with multiple geocells at different granularities—a photo from Paris might be labeled as {Europe, France, Île-de-France, Paris}, allowing the model to learn features at multiple geographic scales simultaneously.

What makes PIGEON particularly effective is CLIP's text encoder, which remains part of the system even though final predictions don't require text input. During pretraining, CLIP learned that certain visual features correlate with textual descriptions containing geographic information. A photo with red phone boxes and Georgian architecture got associated with texts mentioning "London" or "Britain." PIGEON leverages these learned correlations without needing explicit text at inference time.

The geocell classification approach also provides natural uncertainty quantification. Instead of outputting a single coordinate, PIGEON can return a probability distribution over geocells, indicating confidence levels. A distinctive landmark might produce a sharp distribution over a small geographic area, while a generic forest scene might produce a broad distribution spanning multiple regions with similar vegetation.

According to the paper, PIGEON achieves state-of-the-art results on standard benchmarks: on Im2GPS3K, it reaches 87.6% accuracy for country-level predictions and 34.0% for street-level (within 1km), substantial improvements over prior CNN-based methods. The hierarchical classification framework proves more data-efficient than regression approaches, requiring fewer training examples to achieve comparable performance.

Gotcha

The repository's most significant limitation is intentional: it's effectively a reference implementation without the components needed for actual use. The geocell definitions, training datasets, validation data, and model weights are all withheld. The authors explicitly state this is due to privacy concerns—the technology works too well. As NPR's coverage highlighted, a system that can accurately determine where a photo was taken creates serious de-anonymization risks. Someone posting what they think is an anonymous photo of their backyard could inadvertently reveal their home address.

This ethical stance means the repository is valuable primarily for understanding the approach rather than deploying it. You can study the code structure, understand the inference pipeline, and learn about hierarchical geocell classification, but you cannot run the actual trained model or reproduce the results without significant independent work. Even if you wanted to recreate PIGEON, you'd need to source your own geotagged training data (millions of images with reliable coordinates), define your own geocell hierarchy, and invest in substantial GPU compute for fine-tuning CLIP at scale.

Beyond availability issues, PIGEON's performance is geographically uneven. The model learns from training data distribution, so regions with more geotagged photos (North America, Western Europe, urban areas) get better predictions than underrepresented regions. The system also struggles with images that could plausibly exist anywhere—generic indoor photos, close-ups of objects, or minimalist landscapes without distinctive features. While the hierarchical approach handles uncertainty better than regression models, it can still make confidently wrong predictions when visual features mislead it (architecture transplanted across regions, internationally standardized infrastructure, etc.).

Verdict

Use if: You're researching geolocation methods and want to understand how vision-language models can be adapted for spatial reasoning, or you're exploring hierarchical classification approaches for other geographic or spatial tasks. The conceptual framework—discretizing continuous spaces into hierarchical cells and leveraging CLIP's pretrained representations—transfers to adjacent problems like altitude estimation, climate zone classification, or urban/rural categorization. You're also studying responsible AI development and want a case study in balancing open research with privacy considerations. Skip if: You need a production-ready geolocation system for any application. The intentionally incomplete release makes this impractical for deployment, and the privacy implications make it ethically questionable for most use cases involving user-generated content. Skip if you want to reproduce academic results without access to large-scale geotagged datasets and significant compute resources. For practical geolocation needs, reverse image search APIs or commercial street-view matching services provide more responsible alternatives, even if they're less technically sophisticated.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/lukashaas-pigeon.svg)](https://starlog.is/api/badge-click/developer-tools/lukashaas-pigeon)