Heretic: Using Multi-Objective Optimization to Automatically Uncensor Language Models
Hook
What if removing a language model’s safety guardrails was as simple as identifying a single direction vector in high-dimensional space and subtracting it? That’s the insight behind abliteration—and Heretic automates the entire process using the same optimization techniques that power AutoML systems.
Context
Language models ship with safety alignment that makes them refuse certain requests. While this prevents obvious harm, it also creates frustration: researchers studying model behavior hit walls, developers building legitimate applications face arbitrary restrictions, and the models often refuse benign requests due to overzealous filtering. The breakthrough came in 2024 when Arditi et al. published ‘Refusal in Language Models Is Mediated by a Single Direction,’ showing that safety behaviors aren’t diffused throughout billions of parameters but concentrated in specific direction vectors within the model’s latent space. Remove those directions, and the refusal behavior disappears.
But there’s a catch: manual abliteration requires expertise in transformer architectures, careful selection of layer ranges, and tedious parameter tuning. Set the ablation coefficient too high and you damage the model’s general capabilities. Too low and refusals remain. The hyperparameter space is vast—which layers to ablate, how strongly, which prompt pairs to use for computing refusal directions. Heretic solves this by treating abliteration as an optimization problem, using Tree-structured Parzen Estimator (TPE) algorithms via Optuna to automatically search this space. It balances two competing objectives: minimize refusal rate while keeping KL divergence low to preserve the model’s intelligence. The result is abliterated models that match or exceed manually-tuned versions while requiring zero expertise.
Technical Insight
At its core, Heretic implements directional ablation through a surprisingly elegant mathematical operation. The technique computes a ‘refusal direction’ by measuring how model activations differ when processing harmful versus harmless prompts. For each layer in the transformer, Heretic collects residual stream activations—the intermediate representations before they pass through the next layer’s attention mechanism—and calculates the mean difference vector. This vector points in the direction of ‘refusing behavior’ in the model’s latent space.
The ablation itself happens through a modified forward pass. Heretic implements this as a PyTorch hook that intercepts residual stream outputs:
def make_ablation_hook(refusal_dir, coeff):
"""Create hook that ablates refusal direction from residual stream"""
refusal_dir = refusal_dir / (refusal_dir.norm() + 1e-8) # Normalize
def hook(module, input, output):
# Project output onto refusal direction and subtract
if isinstance(output, tuple):
output = output[0]
projection = (output @ refusal_dir) * refusal_dir
return output - coeff * projection
return hook
# Register hooks on transformer layers
for layer_idx in range(start_layer, end_layer):
layer = model.model.layers[layer_idx]
layer.register_forward_hook(
make_ablation_hook(refusal_dirs[layer_idx], ablation_coeff)
)
The genius is in what Heretic automates around this core operation. Instead of manually guessing which layers to ablate (early layers? middle? all of them?), it uses Optuna to explore the hyperparameter space. Each trial ablates different layer ranges with different coefficients, then evaluates the result on two metrics: refusal rate (how often the model still refuses harmful prompts) and KL divergence (how much the output distribution has changed from the original). Optuna’s TPE algorithm learns from previous trials to suggest better hyperparameters, converging on Pareto-optimal solutions that balance both objectives.
The evaluation framework is equally sophisticated. Heretic generates refusal rate scores by testing the ablated model on harmful instruction datasets like JailbreakBench or HarmBench, looking for tell-tale refusal phrases (‘I cannot assist with that,’ ‘I’m not able to provide,’ etc.). KL divergence gets measured on a separate harmless instruction dataset to ensure the model hasn’t lost general capabilities. This separation is critical—you want low refusal on harmful prompts but minimal distribution shift on benign ones.
What makes this approach production-ready is the hardware-aware batching system. Heretic automatically detects available VRAM and batches prompt processing to maximize throughput without OOM errors. For models like Gemma-3-12B, it achieved 0.16 KL divergence compared to 0.45-1.04 for manually-ablated versions from the community, while matching refusal suppression rates. The optimization typically finds optimal parameters within 50-100 trials, taking 2-4 hours on consumer GPUs.
The tool also includes interpretability features for researchers. It can generate PaCMAP projections showing how ‘harmful’ and ‘harmless’ prompt activations cluster in the reduced latent space, visualizing how the refusal direction separates these clusters. Watching these projections across layers reveals which transformer depths encode the strongest refusal behaviors—usually concentrated in middle layers where semantic understanding peaks before output generation begins.
Gotcha
Heretic’s architecture assumptions create hard boundaries around what it can process. The tool fundamentally expects homogeneous transformer stacks—models where every layer has the same structure with consistent residual stream shapes. This breaks for state-space models like Mamba, hybrid architectures that mix attention with recurrence, or models with variable-width layers. If your model uses novel attention mechanisms that significantly alter residual stream geometry, you’ll need to fork the codebase and reimplement the hooking logic.
Reproducibility is another genuine concern. While Heretic produces consistent results on the same hardware with the same PyTorch version, moving across platforms introduces variance. The documentation honestly notes that KL divergence metrics shift between CUDA versions, different GPU architectures (Ampere vs Ada), and even CPU versus GPU execution. This isn’t a bug—it’s an inherent property of floating-point operations in deep learning—but it means you can’t treat the published benchmark numbers as guarantees for your setup. If you’re publishing research that requires exact metric reproduction, you’ll need to document your entire software and hardware stack.
The research features, particularly PaCMAP residual plotting, carry a steep computational cost. Generating these visualizations for a 12B parameter model takes over an hour on CPU and requires substantial memory for dimensionality reduction. This is fine for one-off analysis but impractical for iterative experimentation. And finally, the ethical elephant in the room: removing safety mechanisms is powerful but dangerous. Heretic provides no guardrails around what you do with uncensored models. The responsibility for downstream use cases—and potential misuse—falls entirely on the user.
Verdict
Use Heretic if you’re a researcher studying model behavior and need to isolate safety alignment from other capabilities, a developer building applications where model refusals block legitimate use cases (creative writing tools, uncensored roleplay, medical/legal research), or you want to understand what your model ‘really knows’ beneath its safety training. It’s particularly valuable if you lack deep expertise in transformer internals but need production-quality abliterated models—the automation genuinely delivers. Skip it if you’re working with non-transformer architectures, need guaranteed reproducibility for academic publication across hardware platforms, have ethical or legal concerns about deploying uncensored models, or you’re in a domain where safety mechanisms are actually protecting users from harm. The tool does exactly what it promises with impressive engineering, but that makes it powerful enough to be dangerous in the wrong contexts.