Overview
Heretic combines an advanced implementation of directional ablation (also known as “abliteration”) with TPE-based parameter optimization powered by Optuna. This approach enables Heretic to work completely automatically, finding high-quality abliteration parameters without requiring manual tuning or deep understanding of transformer internals.Key Innovation: Heretic automatically co-minimizes the number of refusals AND the KL divergence from the original model, resulting in decensored models that retain as much of the original model’s intelligence as possible.
High-Level Workflow
The abliteration process follows this workflow:Model Loading & Preparation
- Load the target model with LoRA adapters initialized to identity transformation
- Determine optimal batch size through automated benchmarking
- Detect common response prefixes (e.g.,
<think></think>for CoT models)
Refusal Direction Calculation
- Generate first-token residual vectors for “harmless” prompts (e.g., from
mlabonne/harmless_alpaca) - Generate first-token residual vectors for “harmful” prompts (e.g., from
mlabonne/harmful_behaviors) - Compute per-layer refusal directions as the normalized difference between harmful and harmless residual means
- Optionally orthogonalize directions relative to the “harmless” direction (projected abliteration)
Parameter Optimization
- Run 200 trials (60 random startup trials + 140 TPE-guided trials)
- For each trial:
- Sample abliteration parameters (direction index, weight kernel shape)
- Apply directional ablation via LoRA adapters
- Evaluate using KL divergence (model preservation) and refusal count (censorship removal)
- Build Pareto front of optimal solutions
Key Components
Heretic’s architecture consists of three main components:Analyzer (analyzer.py)
Computes and analyzes the geometric properties of residual vectors:
- Residual Extraction: Generates hidden states at the first output token position for each layer
- Direction Computation: Calculates refusal directions as
normalize(bad_mean - good_mean)for each layer - Geometric Analysis: Computes cosine similarities, norms, and silhouette coefficients
- Visualization: Creates PaCMAP projections showing how “harmful” and “harmless” residuals separate across layers
Model (model.py)
Handles model operations and abliteration:
- LoRA Integration: Uses PEFT library to apply abliteration as low-rank adapters (rank 1 by default, rank 3 for full normalization)
- Directional Ablation: Orthogonalizes weight matrices with respect to refusal directions
- Component Support: Modifies
attn.o_projandmlp.down_proj(including MoE architectures) - Response Generation: Produces model outputs for evaluation and chat
Technical Detail: LoRA abliteration computes
delta W = -lambda * v * (v^T W) where v is the refusal direction, implemented as lora_B = -lambda * v and lora_A = v^T W.Evaluator (evaluator.py)
Measures abliteration quality:
- KL Divergence: Compares first-token probability distributions between base and abliterated models on “harmless” prompts
- Refusal Detection: Scans responses for refusal markers (
"sorry","I cannot","unethical", etc.) - Multi-Objective Scoring: Returns
(kl_divergence, refusals)tuple for Optuna’s Pareto optimization
Optimization Process
Heretic uses Tree-structured Parzen Estimator (TPE) sampling from Optuna to efficiently explore the parameter space:- Startup Phase (60 trials): Random sampling for exploration
- TPE Phase (140 trials): Guided sampling based on promising regions
- Multivariate TPE: Models correlations between parameters for faster convergence
- Pareto Optimization: Maintains non-dominated solutions across the refusals/KL-divergence tradeoff
Checkpoints are automatically saved to
checkpoints/<model-name>.jsonl using Optuna’s journal storage, allowing you to resume interrupted runs.Performance
On an RTX 3090 with default configuration, decensoring Llama-3.1-8B-Instruct takes approximately 45 minutes. Performance can be improved by:- Using 4-bit quantization (
quantization = "bnb_4bit") - Reducing trial count (
n_trials = 100) - Increasing batch size (auto-detected by default)
Related Concepts
Abliteration Deep Dive
Learn about directional ablation and refusal directions
Optimization Process
Understand parameter optimization and weight kernels
