What is Abliteration?
Abliteration (portmanteau of “ablation” + “obliterate”) refers to a technique for removing specific behavioral patterns from language models by identifying and suppressing directional components in the model’s internal representations. In the context of censorship removal, abliteration targets refusal directions — vectors in activation space that correspond to the model refusing to answer prompts.Directional ablation was introduced by Arditi et al. (2024) and further refined by Jim Lai in his work on projected abliteration and norm-preserving biprojected abliteration.
Refusal Directions
Computing Refusal Directions
Refusal directions are computed per-layer as the difference between mean residual vectors for “harmful” and “harmless” prompts:Generate Residuals
For each prompt in both datasets:
- Tokenize and pass through the model
- Extract hidden states (residual vectors) at the first output token position
- Store residuals for each layer:
residuals[prompt, layer, component]
Why First Token Position?
Heretic analyzes residuals at the first generated token rather than the prompt tokens because:- The first output token is where the model “decides” whether to refuse
- It captures the model’s immediate response to the prompt
- It’s consistent across prompts of different lengths
Implementation Detail: Residuals are upcast to
torch.float32 to avoid precision issues with bfloat16 or range problems with float16 during vector operations (see model.py:650-652).Geometric Properties
Refusal directions have interesting geometric properties that can be analyzed using the--print-residual-geometry flag:
- Cosine Similarity: Measures alignment between good/bad/refusal directions
- L2 Norms: Shows magnitude of residuals and refusal directions across layers
- Silhouette Coefficient: Quantifies separation between “harmful” and “harmless” clusters
Matrix Orthogonalization
Basic Directional Ablation
Directional ablation works by orthogonalizing weight matrices with respect to the refusal direction. This prevents the refusal direction from being expressed in the output of matrix multiplications. For a weight matrix W and refusal direction v, the goal is to compute a delta ΔW such that:- λ (lambda) is the ablation weight (typically 0.8-1.5)
- v is the normalized refusal direction
- v^T · W projects W onto v
LoRA-Based Implementation
Heretic implements abliteration using LoRA adapters rather than directly modifying weights. This has several advantages:- Fast Reset: Can reset to identity by zeroing adapter weights
- Memory Efficient: Only stores low-rank deltas
- Quantization Compatible: Works with 4-bit quantized models
Which Components Are Modified?
Heretic abliterates two types of transformer components:1. Attention Output Projection (attn.o_proj)
The output projection that combines attention heads:
2. MLP Down Projection (mlp.down_proj)
The second linear layer in the MLP block:
Advanced Techniques
Projected Abliteration
Implements the technique from Jim Lai’s blog post:main.py:448-457
Row Normalization
Preserves the magnitude of weight matrix rows during abliteration:none: Basic abliteration (default)pre: Compute LoRA relative to row-normalized weightsfull: Approximate norm-preserving biprojected abliteration
Winsorization
Clamps extreme activation values to reduce the impact of “massive activations”:model.py:654-664
Academic References
Heretic’s abliteration implementation is based on:-
Arditi et al. (2024) - Refusal in Language Models Is Mediated by a Single Direction
- Original abliteration paper
- Identifies refusal as a directional phenomenon
-
Lai (2025) - Projected Abliteration
- Blog Post 1: Projected Abliteration
- Orthogonalizes refusal direction relative to “harmless” direction
-
Lai (2025) - Norm-Preserving Biprojected Abliteration
- Blog Post 2: Norm-Preserving
- Preserves row magnitudes for better model preservation
Related Topics
How Heretic Works
System architecture and workflow overview
Optimization Process
Parameter optimization and weight kernels
