Representation Engineering: Controlling AI by Rewriting Its Internal Thoughts
Hook
What if you could make a language model more truthful not by prompting it differently, but by literally reaching into its neural activations and adjusting the "truthfulness direction" like turning a dial?
Context
Traditional approaches to AI interpretability have focused on two extremes: either inspecting individual neurons (which number in the billions and resist clear interpretation) or treating models as black boxes and analyzing only their inputs and outputs. Prompt engineering has become the default method for controlling model behavior, but it's unreliable—the same prompt can produce wildly different results across models or even across queries. Meanwhile, mechanistic interpretability researchers painstakingly trace individual circuits neuron-by-neuron, a process that doesn't scale to modern LLMs with hundreds of billions of parameters.
Representation Engineering (RepE) introduces a middle path inspired by cognitive neuroscience, where researchers don't track individual neurons but instead measure population-level activity patterns. The framework, developed by researchers from institutions including UC Berkeley and Northeastern, operates on a deceptively simple insight: high-level concepts like "truthfulness" or "political bias" aren't encoded in individual neurons but in directions through the high-dimensional activation space of hidden layers. By identifying these direction vectors and manipulating them during inference, you can steer model behavior with unprecedented precision—no fine-tuning, no prompt hacking, just direct intervention in the model's internal representations.
Technical Insight
RepE consists of two complementary pipelines built on HuggingFace transformers: RepReading for extracting and analyzing internal representations, and RepControl for steering model behavior. Both operate on the same fundamental principle—extracting meaningful direction vectors from hidden layer activations across contrastive datasets.
The process begins with dataset construction. You create pairs of prompts that differ along the dimension you care about. For truthfulness, you might have "Pretend you're an accurate person" versus "Pretend you're an inaccurate person," each followed by the same question template. The model processes both versions, and you extract the hidden state activations at specific layers (typically middle-to-late layers where high-level concepts form). The difference between mean activations across these contrastive datasets gives you a direction vector—essentially, the "truthfulness axis" in the model's representation space.
from repe import rep_control_reading_vec
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model_name = "meta-llama/Llama-2-13b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create contrastive datasets
truthful_data = [
{"template": "Consider the question '{question}'. If you are truthful, you would say:",
"question": "What is the capital of France?"},
# ... more examples
]
untruthful_data = [
{"template": "Consider the question '{question}'. If you are deceptive, you would say:",
"question": "What is the capital of France?"},
# ... more examples
]
# Extract representation direction
rep_reader = rep_control_reading_vec(
model=model,
tokenizer=tokenizer,
layer_ids=list(range(10, 30)), # Target middle-to-late layers
batch_size=16
)
# Get the truthfulness direction vector
rep_reader.generate_rep_direction(
truthful_data=truthful_data,
untruthful_data=untruthful_data
)
# Now use RepControl to steer model outputs
from repe import rep_control_pipeline
rep_controller = rep_control_pipeline(
model=model,
tokenizer=tokenizer,
layers=list(range(10, 30)),
control_method="reading_vec"
)
# Generate text with enhanced truthfulness
output = rep_controller.generate(
prompt="The Earth is",
direction=rep_reader.direction,
coefficient=2.0, # Positive coefficient increases truthfulness
max_new_tokens=50
)
The coefficient parameter is where the magic happens—it controls the strength of intervention. Positive values push the model's representations toward the target direction (more truthful), while negative values push away (less truthful). The framework adds coefficient * direction_vector to the hidden activations at each targeted layer during the forward pass, effectively rewriting the model's internal state as it generates each token.
What makes this approach powerful is its layer-selectivity. Early layers in transformers typically handle low-level features like syntax and tokenization, while deeper layers encode semantic meaning and high-level concepts. RepE lets you target specific layer ranges—experiments show that interventions at layers 15-25 (in a 40-layer model) often work best for semantic concepts, while earlier layers affect surface-level features like formality or verbosity.
The framework also includes RepE_eval, a probing system that uses these extracted representations for zero-shot classification. Instead of prompting the model, you extract representations for each answer choice and measure which direction the model's internal state is closest to. On benchmarks like BBH and MMLU, this representation-based probing sometimes outperforms few-shot prompting, revealing that models often "know" more internally than they express in their outputs—a phenomenon the authors call the representation-output gap.
The linear assumption underlying RepE—that concepts correspond to linear directions in representation space—is surprisingly robust. While neural networks are highly nonlinear, research in both AI and neuroscience suggests that high-level features often occupy approximately linear subspaces. This doesn't mean all concepts are perfectly linear, but it means linear interventions provide a useful first-order approximation that works remarkably well in practice for controlling attributes like sentiment, formality, political bias, and even more abstract concepts like "power-seeking tendency" in AI safety contexts.
Gotcha
RepE's biggest limitation is its sensitivity to dataset construction and hyperparameter selection—and the documentation provides limited guidance for either. The effectiveness of extracted direction vectors depends heavily on your contrastive prompts. Too subtle a contrast and the direction captures noise; too extreme and it captures confounds (like formality differences instead of truthfulness). There's no automated way to validate that your extracted direction actually represents what you think it does, beyond generating outputs and manually inspecting them.
Layer selection is equally finicky. The optimal layer range varies by model architecture, model size, and the concept you're targeting. The repository's examples mostly use Llama-2 models with hardcoded layer ranges, leaving you to experiment blindly if you're working with different architectures. The computational cost of this experimentation adds up—extracting directions requires multiple forward passes across your entire contrastive dataset for each layer you're analyzing. For large models, this means significant GPU time just to find the right configuration.
The intervention strength (coefficient) requires manual tuning for each use case. Too weak and you see no behavioral change; too strong and you get degenerate outputs or mode collapse where the model just repeats the same phrases. There's no principled way to set this value—you're essentially doing hyperparameter search by eyeballing generated text. Additionally, interventions can have unexpected side effects. Pushing strongly toward "truthfulness" might also make outputs more formal or hedged, because these attributes are correlated in the training data and thus in representation space. Disentangling these confounds requires careful dataset design or more sophisticated methods not provided in the framework.
Verdict
Use if: You're doing AI safety research and need interpretable methods to detect or control high-level model behaviors like deception, bias, or memorization; you're experimenting with steering model outputs beyond what prompt engineering can achieve; you want to probe what models internally represent versus what they output; or you're comfortable working with research-grade code and can invest time in dataset creation and hyperparameter tuning. RepE provides unique capabilities for direct behavioral control that no other accessible framework offers. Skip if: You need production-ready tooling with robust APIs and documentation; you're working with non-transformer architectures or very small models (where the high-dimensional representation space assumptions break down); you require guaranteed, consistent behavioral changes (interventions are probabilistic and context-dependent); or you lack the GPU resources for extensive experimentation. For production use cases, stick with prompt engineering and fine-tuning until representation engineering matures beyond its current research stage.