GEIA: Why Your Sentence Embeddings Are Leaking Secrets
Hook
That 384-dimensional vector you're using to cache user queries? A trained GPT-2 model can reconstruct the original sentence with alarming accuracy. Sentence embeddings aren't as safe as you think.
Context
Sentence embeddings have become the backbone of modern NLP infrastructure. Whether you're building semantic search, recommendation systems, or retrieval-augmented generation pipelines, you're probably converting text into dense vectors using models like sentence-BERT or SimCSE. The implicit assumption: these high-dimensional vectors are abstract representations that capture semantic meaning while obscuring the original content.
That assumption is dangerously wrong. Researchers at HKUST's Knowledge Computing Lab developed GEIA (Generative Embedding Inversion Attack) to demonstrate that sentence embeddings leak far more information than previously understood. Published at ACL 2023 Findings, their work shows that an attacker with access to embeddings can train a generative model to reconstruct original sentences with high fidelity. This matters because many systems treat embeddings as semi-public data—storing them in vector databases, transmitting them across service boundaries, or caching them for performance. If you're building systems that handle sensitive data using sentence embeddings, you need to understand this vulnerability.
Technical Insight
GEIA's architecture is elegantly simple but devastatingly effective. The attack works in two stages: training an attacker model on embedding-sentence pairs, then using that model to decode new embeddings back into text. The repository supports multiple victim models (sentence-BERT variants, SimCSE) and attacker architectures (GPT-2, OPT, T5, DialoGPT), allowing comprehensive evaluation of the vulnerability.
The core innovation is treating embedding inversion as a conditional text generation problem. Here's how the training pipeline works for a GPT-2 attacker:
# Simplified from the GEIA codebase
# Train attacker to generate sentences conditioned on embeddings
class EmbeddingProjector(nn.Module):
def __init__(self, embedding_dim, hidden_dim, num_layers=2):
super().__init__()
# Project victim embedding to GPT-2's embedding space
self.projection = nn.Sequential(
nn.Linear(embedding_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim)
)
def forward(self, victim_embedding):
# Map 384-dim sentence-BERT vector to GPT-2 space
return self.projection(victim_embedding)
# During training:
victim_embedding = victim_model.encode(original_sentence)
projected_embedding = projector(victim_embedding)
# Prepend projected embedding as "context" for GPT-2
inputs_embeds = torch.cat([projected_embedding, text_embeddings], dim=1)
outputs = gpt2_model(inputs_embeds=inputs_embeds, labels=target_ids)
loss = outputs.loss # Standard cross-entropy for next-token prediction
The projection network learns to map the victim model's embedding space into the attacker model's input space. Once trained, the attacker can generate text from embeddings using beam search or sampling—no access to the original victim model required.
What makes this particularly concerning is the attack's generalization capability. The researchers trained attackers on datasets like PersonaChat and evaluated them on completely different domains (QNLI question pairs, MNLI natural language inference, SST-2 sentiment data). The results show that attackers trained on conversational data can still recover substantial information from embeddings of classification tasks or question-answering pairs.
The codebase separates attacker implementations based on their decoding strategies. GPT-2 and OPT use autoregressive left-to-right generation, while T5 uses encoder-decoder architecture. Each requires different handling:
# From the evaluation pipeline
# GPT-2/OPT style (autoregressive)
with torch.no_grad():
projected = projector(victim_embedding.unsqueeze(0))
outputs = model.generate(
inputs_embeds=projected,
max_length=50,
num_beams=5,
early_stopping=True
)
recovered_text = tokenizer.decode(outputs[0])
# T5 style (encoder-decoder)
with torch.no_grad():
encoder_outputs = encoder_projection(victim_embedding)
outputs = t5_model.generate(
encoder_outputs=encoder_outputs,
max_length=50,
num_beams=5
)
recovered_text = tokenizer.decode(outputs[0])
The repository includes evaluation metrics beyond simple BLEU scores. It measures exact match recovery, semantic similarity between original and recovered sentences, and task-specific metrics. For example, on PersonaChat dialogues with sentence-BERT embeddings, GPT-2 attackers achieve BLEU-4 scores above 60 and ROUGE-L scores above 70—meaning the recovered sentences are often nearly identical to the originals.
What's particularly insidious is that the attack works even with randomly initialized GPT-2 models trained from scratch as attackers. You don't need sophisticated pre-training or massive compute—the vulnerability is inherent to how sentence embeddings compress information. The embedding dimensionality (typically 384-768) is large enough to preserve recoverable information about the original sequence, especially for shorter sentences common in applications like search queries or chat messages.
Gotcha
GEIA is research code, not production software, and it shows. The repository has minimal documentation, with hardcoded paths scattered throughout that you'll need to manually configure. Want to run experiments? You'll be editing config files and script arguments across multiple Python files. The ABCD dataset requires manual download from Google Drive, and there's no automated setup script to handle dependencies or data preparation.
More importantly, the repository provides no pre-trained attacker models. If you want to demonstrate the vulnerability on your own embeddings, you'll need to train attackers from scratch, which requires substantial compute and training data. The researchers used datasets like PersonaChat with 17,878 conversations and WMT16 with millions of translation pairs. Training a GPT-2 attacker on this scale isn't trivial—expect to need GPU resources and several hours to days of training depending on your setup. The code also lacks comprehensive error handling and assumes you're familiar with the research paper's methodology. If you're not comfortable reading ACL papers and translating experimental setups into code configurations, you'll struggle to get meaningful results.
Verdict
Use if: You're a security researcher auditing NLP systems for privacy vulnerabilities, a privacy engineer evaluating the safety of embedding-based architectures, or an academic studying information leakage in neural representations. This codebase provides the definitive demonstration that sentence embeddings are not privacy-preserving and gives you the tools to prove it empirically. It's invaluable for threat modeling, responsible disclosure of vulnerabilities, or convincing stakeholders that embedding-based systems need additional privacy protections. Skip if: You're building production NLP applications and just want secure embeddings—this shows you the problem but doesn't solve it. You'll need to look at differential privacy mechanisms, homomorphic encryption, or secure multi-party computation approaches instead. Also skip if you're looking for polished, well-documented tools or don't have the compute resources and ML expertise to train generative models. This is a research artifact that proves a point brilliantly but requires significant effort to operationalize.