Back to Articles

Control Your LLM's Personality in Under 60 Seconds with Representation Engineering

[ View on GitHub ]

Control Your LLM’s Personality in Under 60 Seconds with Representation Engineering

Hook

What if you could make a language model more honest, creative, or formal—not by retraining it, not by prompt engineering, but by doing vector arithmetic on its internal thoughts?

Context

Traditional approaches to controlling LLM behavior fall into two camps: prompt engineering and fine-tuning. Prompt engineering is fast but unreliable—the model might ignore your instructions, and you burn context window tokens on every request. Fine-tuning with RLHF or LoRA gives robust control but demands substantial compute, high-quality datasets, and hours of training time. You’re stuck choosing between fragile zero-shot methods and expensive retraining cycles.

Representation engineering offers a third path. The core insight is that high-level concepts like honesty, sentiment, and writing style appear to have linear representations in transformer activation space. If ‘honesty’ is a direction in the model’s hidden states, you can measure that direction by comparing activations when the model processes contrasting prompts. Once you’ve extracted this direction as a vector, you can add or subtract it during inference to steer behavior. The repeng library implements this technique as a practical Python tool, wrapping HuggingFace transformers to make activation steering accessible without diving into model internals. The library derives from the andyzoujm/representation-engineering repository.

Technical Insight

Inference

Training

HuggingFace Model

ControlModel Wrapper

Contrastive Prompt Pairs

Forward Pass Collection

Layer Activations

Mean Difference Computation

Control Vector

Inference Engine

User Input

Modified Activations

Steered Output

System architecture — auto-generated

The library’s architecture centers on the ControlModel wrapper, which intercepts forward passes to inject control vectors into specified layers. You provide a layer range (typically negative indices counting backward from the output), and repeng modifies activations at those positions during both training and inference.

Training control vectors is remarkably fast—the README claims you can “train a vector in less than sixty seconds.” The method appears to use contrastive activation collection rather than gradient-based optimization. You create dataset pairs where each entry has a positive example (“Act as if you’re extremely honest”) and a negative example (“Act as if you’re extremely deceptive”). The library runs forward passes for both, extracts hidden state activations at your target layers, then computes the mean difference. That difference vector captures the directional shift in activation space.

Here’s the complete workflow from the README:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from repeng import ControlVector, ControlModel

# Wrap your model with layer specification
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model = ControlModel(model, list(range(-5, -18, -1)))

# Create contrastive pairs (positive vs negative personas)
trippy_dataset = make_dataset(
    "Act as if you're extremely {persona}.",
    ["high on psychedelic drugs"],
    ["sober from psychedelic drugs"],
    truncated_output_suffixes,
)

# Train in under a minute
trippy_vector = ControlVector.train(model, tokenizer, trippy_dataset)

# Apply at inference time with strength scaling
model.set_control(trippy_vector, strength=2.2)

The layer range list(range(-5, -18, -1)) targets layers -5 through -17 (13 layers counting backward from the output). The choice of which layers to target appears significant for behavioral steering, though the README doesn’t explain the underlying theory.

The strength parameter scales the control vector linearly. At strength=1, you add the raw trained vector to activations. At strength=2.2, you amplify its magnitude. Negative strengths flip the direction—using -2.2 with a ‘psychedelic’ vector produces the opposite effect. The README example demonstrates this spectrum, from conservative output at -2.2 to highly psychedelic (and eventually incoherent) output at +2.2.

One powerful feature is GGUF export for quantized deployment. After training a control vector, you can export it with trippy_vector.export_gguf('trippy.gguf') and load it into llama.cpp with any quantization level. This means you can train vectors on GPU with fp16, then deploy them on CPU with 4-bit quants—the control mechanism involves adding a vector to activations, which works with quantized models.

The library imports ControlVector, ControlModel, and DatasetEntry classes to structure the training and inference workflow.

Gotcha

The library has sharp edges that aren’t immediately obvious. First, it explicitly does not support Mixture-of-Experts architectures like Mixtral. The README states: “Vector training currently does not work with MoE models (such as Mixtral).” The note mentions this is “theoretically fixable with some work” but offers no timeline.

Second, extreme control strengths produce degenerate outputs. The README’s own example at strength=2.2 shows catastrophic collapse: repeated ‘o’ tokens and gibberish like ‘��psy����������oodle����psy��oooooooooo…’. This suggests that large strength values push activations outside the model’s stable operating range, causing attention patterns or token probabilities to break down. You’ll need to manually tune strength values for your use case through experimentation.

Third, the library requires accelerate for some example notebooks, which must be manually installed separately. The README notes: “Some example notebooks require accelerate, which must be manually installed with pip install accelerate.” This isn’t needed for the core library functionality, but could trip up users trying to run the examples.

Fourth, effectiveness likely depends heavily on dataset quality. If your contrastive pairs aren’t truly opposite along a single axis, the resulting vector may capture mixed or unclear directions. ‘Honest’ versus ‘deceptive’ works because they’re clear antonyms. Less well-defined contrasts might yield less useful vectors, though the README doesn’t provide guidance on dataset construction beyond the basic example.

Verdict

Use repeng if you need fast, interpretable behavior steering for transformer models in research or prototyping, especially when you can define clear contrastive axes (honest vs. deceptive, formal vs. casual, optimistic vs. pessimistic). It excels when you want to avoid fine-tuning costs but need more control than prompt engineering alone. The GGUF export makes it practical for deploying modified models on consumer hardware with quantization. Skip it if you’re working with Mixtral or other MoE architectures (explicitly unsupported), if you need multi-dimensional control beyond single-axis steering, or if your application can’t tolerate occasional degenerate outputs at high strength values. Also consider alternatives if you’re unsure how to craft high-quality contrastive datasets, as the effectiveness of trained vectors depends on dataset quality.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/vgel-repeng.svg)](https://starlog.is/api/badge-click/llm-engineering/vgel-repeng)