Skip to main content

What is Abliteration?

Abliteration (portmanteau of “ablation” + “obliterate”) refers to a technique for removing specific behavioral patterns from language models by identifying and suppressing directional components in the model’s internal representations. In the context of censorship removal, abliteration targets refusal directions — vectors in activation space that correspond to the model refusing to answer prompts.
Directional ablation was introduced by Arditi et al. (2024) and further refined by Jim Lai in his work on projected abliteration and norm-preserving biprojected abliteration.

Refusal Directions

Computing Refusal Directions

Refusal directions are computed per-layer as the difference between mean residual vectors for “harmful” and “harmless” prompts:
1

Generate Residuals

For each prompt in both datasets:
  • Tokenize and pass through the model
  • Extract hidden states (residual vectors) at the first output token position
  • Store residuals for each layer: residuals[prompt, layer, component]
2

Compute Mean Residuals

Calculate the mean residual vector for each layer:
good_means = good_residuals.mean(dim=0)  # [layer, hidden_dim]
bad_means = bad_residuals.mean(dim=0)    # [layer, hidden_dim]
3

Calculate Refusal Directions

Compute normalized difference as the refusal direction:
refusal_directions = F.normalize(
    bad_means - good_means, 
    p=2, 
    dim=1
)
This produces one refusal direction vector per layer.

Why First Token Position?

Heretic analyzes residuals at the first generated token rather than the prompt tokens because:
  • The first output token is where the model “decides” whether to refuse
  • It captures the model’s immediate response to the prompt
  • It’s consistent across prompts of different lengths
Implementation Detail: Residuals are upcast to torch.float32 to avoid precision issues with bfloat16 or range problems with float16 during vector operations (see model.py:650-652).

Geometric Properties

Refusal directions have interesting geometric properties that can be analyzed using the --print-residual-geometry flag:
  • Cosine Similarity: Measures alignment between good/bad/refusal directions
  • L2 Norms: Shows magnitude of residuals and refusal directions across layers
  • Silhouette Coefficient: Quantifies separation between “harmful” and “harmless” clusters
heretic Qwen/Qwen3-4B-Instruct-2507 --print-residual-geometry

Matrix Orthogonalization

Basic Directional Ablation

Directional ablation works by orthogonalizing weight matrices with respect to the refusal direction. This prevents the refusal direction from being expressed in the output of matrix multiplications. For a weight matrix W and refusal direction v, the goal is to compute a delta ΔW such that:
(W + ΔW) · v ≈ 0
The solution is:
ΔW = -λ · v · (v^T · W)
Where:
  • λ (lambda) is the ablation weight (typically 0.8-1.5)
  • v is the normalized refusal direction
  • v^T · W projects W onto v

LoRA-Based Implementation

Heretic implements abliteration using LoRA adapters rather than directly modifying weights. This has several advantages:
  • Fast Reset: Can reset to identity by zeroing adapter weights
  • Memory Efficient: Only stores low-rank deltas
  • Quantization Compatible: Works with 4-bit quantized models
# LoRA abliteration: delta W = -lambda * v * (v^T W)
# Decompose as: lora_B = -lambda * v, lora_A = v^T W

v = layer_refusal_direction.to(module.weight.device)

# Get W (dequantize if necessary)
base_weight = module.base_layer.weight
if quant_state is None:
    W = base_weight.to(torch.float32)
else:
    # 4-bit quantization support
    W = bnb.functional.dequantize_4bit(
        base_weight.data, quant_state
    ).to(torch.float32)

# Flatten to (out_features, in_features)
W = W.view(W.shape[0], -1)

# Calculate lora_A = v^T W (v is [d_out,], W is [d_out, d_in])
lora_A = (v @ W).view(1, -1)

# Calculate lora_B = -weight * v
lora_B = (-weight * v).view(-1, 1)

# Assign to adapters
module.lora_A["default"].weight.data = lora_A.to(weight_A.dtype)
module.lora_B["default"].weight.data = lora_B.to(weight_B.dtype)

Which Components Are Modified?

Heretic abliterates two types of transformer components:

1. Attention Output Projection (attn.o_proj)

The output projection that combines attention heads:
# Standard self-attention (most models)
layer.self_attn.o_proj

# Linear attention (Qwen3.5 MoE hybrid layers)
layer.linear_attn.out_proj

2. MLP Down Projection (mlp.down_proj)

The second linear layer in the MLP block:
# Dense models (Llama, Qwen, Gemma, etc.)
layer.mlp.down_proj

# MoE models - Qwen3 style
for expert in layer.mlp.experts:
    expert.down_proj

# MoE models - Phi-3.5-MoE style
for expert in layer.block_sparse_moe.experts:
    expert.w2

# Granite MoE Hybrid - attention layers
layer.shared_mlp.output_linear

# Granite MoE Hybrid - MoE layers
for expert in layer.moe.experts:
    expert.output_linear
Heretic modifies output projections (where information is combined/reduced) rather than input projections. This is critical because:
  • Output projections are where the model “writes” to the residual stream
  • Abliteration prevents refusal information from being written
  • Input projections would require different mathematical treatment

Advanced Techniques

Projected Abliteration

Implements the technique from Jim Lai’s blog post:
main.py:448-457
if settings.orthogonalize_direction:
    # Only subtract the component orthogonal to good direction
    good_directions = F.normalize(good_means, p=2, dim=1)
    projection_vector = torch.sum(
        refusal_directions * good_directions, dim=1
    )
    refusal_directions = (
        refusal_directions 
        - projection_vector.unsqueeze(1) * good_directions
    )
    refusal_directions = F.normalize(refusal_directions, p=2, dim=1)
orthogonalize_direction = true

Row Normalization

Preserves the magnitude of weight matrix rows during abliteration:
if self.settings.row_normalization != RowNormalization.NONE:
    # Keep original norms
    W_org = W
    W_row_norms = LA.vector_norm(W, dim=1, keepdim=True)
    W = F.normalize(W, p=2, dim=1)

if self.settings.row_normalization == RowNormalization.PRE:
    # Scale LoRA to work with original magnitudes
    lora_B = W_row_norms * lora_B
    
elif self.settings.row_normalization == RowNormalization.FULL:
    # Apply abliteration and renormalize
    W = W + lora_B @ lora_A
    W = F.normalize(W, p=2, dim=1)
    W = W * W_row_norms  # Restore original norms
    W = W - W_org
    
    # Low-rank SVD approximation
    r = self.peft_config.r
    U, S, Vh = torch.svd_lowrank(W, q=2*r+4, niter=6)
    U = U[:, :r]
    S = S[:r]
    Vh = Vh[:, :r].T
    
    # Split singular values between components
    sqrt_S = torch.sqrt(S)
    lora_B = U @ torch.diag(sqrt_S)
    lora_A = torch.diag(sqrt_S) @ Vh

Winsorization

Clamps extreme activation values to reduce the impact of “massive activations”:
model.py:654-664
if 0 <= self.settings.winsorization_quantile < 1:
    abs_residuals = torch.abs(residuals)
    thresholds = torch.quantile(
        abs_residuals,
        self.settings.winsorization_quantile,
        dim=2,
        keepdim=True,
    )
    return torch.clamp(residuals, -thresholds, thresholds)
# Clamp to 95th percentile (disabled by default)
winsorization_quantile = 0.95

Academic References

Heretic’s abliteration implementation is based on:
  1. Arditi et al. (2024) - Refusal in Language Models Is Mediated by a Single Direction
    • Original abliteration paper
    • Identifies refusal as a directional phenomenon
  2. Lai (2025) - Projected Abliteration
  3. Lai (2025) - Norm-Preserving Biprojected Abliteration

How Heretic Works

System architecture and workflow overview

Optimization Process

Parameter optimization and weight kernels

Build docs developers (and LLMs) love