How Heretic Works

Overview

Heretic combines an advanced implementation of directional ablation (also known as “abliteration”) with TPE-based parameter optimization powered by Optuna. This approach enables Heretic to work completely automatically, finding high-quality abliteration parameters without requiring manual tuning or deep understanding of transformer internals.

Key Innovation: Heretic automatically co-minimizes the number of refusals AND the KL divergence from the original model, resulting in decensored models that retain as much of the original model’s intelligence as possible.

High-Level Workflow

The abliteration process follows this workflow:

Model Loading & Preparation

Load the target model with LoRA adapters initialized to identity transformation
Determine optimal batch size through automated benchmarking
Detect common response prefixes (e.g., <think></think> for CoT models)

Refusal Direction Calculation

Generate first-token residual vectors for “harmless” prompts (e.g., from mlabonne/harmless_alpaca)
Generate first-token residual vectors for “harmful” prompts (e.g., from mlabonne/harmful_behaviors)
Compute per-layer refusal directions as the normalized difference between harmful and harmless residual means
Optionally orthogonalize directions relative to the “harmless” direction (projected abliteration)

Parameter Optimization

Run 200 trials (60 random startup trials + 140 TPE-guided trials)
For each trial:
- Sample abliteration parameters (direction index, weight kernel shape)
- Apply directional ablation via LoRA adapters
- Evaluate using KL divergence (model preservation) and refusal count (censorship removal)
Build Pareto front of optimal solutions

Model Export

User selects preferred trial from Pareto-optimal set
Merge LoRA adapters into base model weights
Save locally or upload to Hugging Face

Key Components

Heretic’s architecture consists of three main components:

Analyzer (`analyzer.py`)

Computes and analyzes the geometric properties of residual vectors:

Residual Extraction: Generates hidden states at the first output token position for each layer
Direction Computation: Calculates refusal directions as normalize(bad_mean - good_mean) for each layer
Geometric Analysis: Computes cosine similarities, norms, and silhouette coefficients
Visualization: Creates PaCMAP projections showing how “harmful” and “harmless” residuals separate across layers

Model (`model.py`)

Handles model operations and abliteration:

LoRA Integration: Uses PEFT library to apply abliteration as low-rank adapters (rank 1 by default, rank 3 for full normalization)
Directional Ablation: Orthogonalizes weight matrices with respect to refusal directions
Component Support: Modifies attn.o_proj and mlp.down_proj (including MoE architectures)
Response Generation: Produces model outputs for evaluation and chat

Technical Detail: LoRA abliteration computes delta W = -lambda * v * (v^T W) where v is the refusal direction, implemented as lora_B = -lambda * v and lora_A = v^T W.

Evaluator (`evaluator.py`)

Measures abliteration quality:

KL Divergence: Compares first-token probability distributions between base and abliterated models on “harmless” prompts
Refusal Detection: Scans responses for refusal markers ("sorry", "I cannot", "unethical", etc.)
Multi-Objective Scoring: Returns (kl_divergence, refusals) tuple for Optuna’s Pareto optimization

# Compute refusal directions from residuals
g = self.good_residuals.mean(dim=0)
g_star = torch.stack([
    compute_geometric_median(
        self.good_residuals[:, layer_index, :].detach().cpu()
    ).median
    for layer_index in range(len(self.model.get_layers()) + 1)
])
b = self.bad_residuals.mean(dim=0)
b_star = torch.stack([
    compute_geometric_median(
        self.bad_residuals[:, layer_index, :].detach().cpu()
    ).median
    for layer_index in range(len(self.model.get_layers()) + 1)
])
r = b - g  # Refusal direction for means
r_star = b_star - g_star  # Refusal direction for medians

Optimization Process

Heretic uses Tree-structured Parzen Estimator (TPE) sampling from Optuna to efficiently explore the parameter space:

Startup Phase (60 trials): Random sampling for exploration
TPE Phase (140 trials): Guided sampling based on promising regions
Multivariate TPE: Models correlations between parameters for faster convergence
Pareto Optimization: Maintains non-dominated solutions across the refusals/KL-divergence tradeoff

Checkpoints are automatically saved to checkpoints/<model-name>.jsonl using Optuna’s journal storage, allowing you to resume interrupted runs.

Performance

On an RTX 3090 with default configuration, decensoring Llama-3.1-8B-Instruct takes approximately 45 minutes. Performance can be improved by:

Using 4-bit quantization (quantization = "bnb_4bit")
Reducing trial count (n_trials = 100)
Increasing batch size (auto-detected by default)

Abliteration Deep Dive

Learn about directional ablation and refusal directions

Optimization Process

Understand parameter optimization and weight kernels

Get Started

Core Concepts

How Heretic Works

Overview

High-Level Workflow

Key Components

Analyzer (`analyzer.py`)

Model (`model.py`)

Evaluator (`evaluator.py`)

Optimization Process

Performance

Abliteration Deep Dive

Optimization Process

Build docs developers (and LLMs) love

Get Started

Core Concepts

​Overview

​High-Level Workflow

​Key Components

​Analyzer (analyzer.py)

​Model (model.py)

​Evaluator (evaluator.py)

​Optimization Process

​Performance

​Related Concepts

Abliteration Deep Dive

Optimization Process

Build docs developers (and LLMs) love

Overview

High-Level Workflow

Key Components

Analyzer (`analyzer.py`)

Model (`model.py`)

Evaluator (`evaluator.py`)

Optimization Process

Performance

Related Concepts