Overview
Heretic uses Tree-structured Parzen Estimator (TPE) optimization from Optuna to automatically find high-quality abliteration parameters. Unlike manual abliteration approaches, Heretic explores the parameter space intelligently, co-minimizing refusals and KL divergence to achieve optimal censorship removal while preserving model capabilities.Default Configuration: 200 trials total (60 random startup + 140 TPE-guided), taking ~45 minutes for Llama-3.1-8B on an RTX 3090.
TPE-Based Optimization
What is TPE?
Tree-structured Parzen Estimator is a Bayesian optimization algorithm that:-
Models the parameter space using two distributions:
- Good distribution
l(x): Parameters that led to good scores - Bad distribution
g(x): Parameters that led to poor scores
- Good distribution
-
Samples new parameters by maximizing the ratio
l(x) / g(x):- High ratio → likely to improve on best-seen results
- Balances exploration (trying new regions) vs. exploitation (refining known good regions)
-
Adapts over time as more trials complete:
- Early trials: Random exploration (startup phase)
- Later trials: Focused search around promising regions
Heretic’s TPE Configuration
Multivariate TPE
Heretic uses multivariate TPE, which models correlations between parameters:- Recognizes that
max_weightandmin_weightare related - Understands that
direction_indexaffects optimalmax_weight_position - Converges faster than independent parameter sampling
Ablation Parameters
Heretic optimizes several parameters that control the abliteration process:Direction Scope
Controls whether to use a single global refusal direction or per-layer directions:main.py:480-486
global: Uses one interpolated direction for all layers- More consistent across layers
- Requires optimizing
direction_index
per layer: Uses each layer’s computed refusal direction- Adapts to per-layer geometry
- Sets
direction_index = None
Direction Index
Forglobal scope, specifies which layer’s refusal direction to use (with interpolation):
main.py:497-501
Innovation: Direction index is a float rather than integer. Non-integral values linearly interpolate between adjacent layer directions, vastly expanding the search space.
Component Parameters
For each transformer component (attn.o_proj, mlp.down_proj), Heretic optimizes:
1. Max Weight (max_weight)
The peak ablation strength applied at max_weight_position:
main.py:512-516
- Values < 1.0: Partial suppression of refusal direction
- Value = 1.0: Complete orthogonalization (theoretical)
- Values > 1.0: Over-correction (sometimes beneficial)
2. Max Weight Position (max_weight_position)
Which layer receives the maximum ablation weight:
main.py:517-521
Observation: Refusal behavior is typically strongest in the later 60-100% of layers, which is why the search range focuses there (based on Arditi et al. 2024).
3. Min Weight (min_weight)
The minimum ablation strength at the edges of the kernel:
main.py:525-529
4. Min Weight Distance (min_weight_distance)
How many layers away from max_weight_position to apply ablation:
main.py:530-534
Weight Kernel Shape
The four component parameters define a weight kernel that specifies ablation strength across layers:Kernel Computation
model.py:422-438
Example Kernels
Per-Component Optimization
Innovation: Heretic optimizes parameters separately for each component (
attn.o_proj vs mlp.down_proj). This allows:- Different ablation strengths (MLP typically requires gentler treatment)
- Different layer targeting (attention vs MLP may have different refusal geometry)
Multi-Objective Evaluation
Co-Minimization Goals
Heretic minimizes two objectives simultaneously:- KL Divergence: Measures how much the model’s behavior changes on harmless prompts
- Refusals: Counts how many harmful prompts still trigger refusals
KL Divergence Calculation
Measures the divergence between first-token probability distributions:evaluator.py:98-103
Why First Token? First-token KL divergence is a sensitive measure of model damage because:
- It captures immediate behavioral changes
- It’s fast to compute (no full generation needed)
- It correlates well with overall model quality
Refusal Detection
Classifies responses using keyword matching:evaluator.py:47-65
Pareto Front
Optuna maintains a Pareto front of non-dominated solutions:main.py:634-647
- Lower refusals AND lower KL divergence
Users can choose from multiple Pareto-optimal solutions based on their preference for compliance (low refusals) vs. preservation (low KL divergence).
Convergence and Trials
Trial Count
Default: 200 trials (60 random startup + 140 TPE-guided)- Startup trials (random): Build initial model of parameter space
- TPE trials (guided): Refine search around promising regions
Checkpointing
Optuna automatically saves progress after each trial:main.py:237-247
- Resume interrupted runs (press Ctrl+C anytime)
- Review previous results without re-running
- Run additional trials later if unsatisfied
Convergence Behavior
Typical optimization trajectory:- Trials 1-20: Wide exploration, high variance in scores
- Trials 20-60: Identify promising parameter regions
- Trials 60-100: TPE focuses on best regions, rapid improvement
- Trials 100-200: Fine-tuning, diminishing returns
For most models, 100-150 trials are sufficient to find excellent solutions. The default 200 provides additional refinement and robustness.
Performance Optimization
Batch Size Auto-Detection
main.py:332-376
Auto-detection finds the largest batch size that fits in VRAM, maximizing throughput. Typically finds batch sizes of 16-128 depending on GPU and model size.
4-Bit Quantization
Drastically reduces VRAM requirements:config.default.toml
- 4x memory reduction (e.g., 70B model fits in 24GB VRAM)
- Minimal quality degradation for abliteration
- Slightly slower inference (~10-20%)
Results and Model Quality
Benchmark Comparison
From the Heretic README (google/gemma-3-12b-it):| Model | Refusals | KL Divergence |
|---|---|---|
| Original | 97/100 | 0 (baseline) |
| mlabonne/gemma-3-12b-it-abliterated-v2 | 3/100 | 1.04 |
| huihui-ai/gemma-3-12b-it-abliterated | 3/100 | 0.45 |
| p-e-w/gemma-3-12b-it-heretic (Heretic) | 3/100 | 0.16 |
Heretic achieves 2.8x lower KL divergence than the best manual abliteration, indicating significantly better preservation of original model capabilities.
Interpreting KL Divergence
- < 0.3: Excellent preservation, minimal behavior change
- 0.3 - 1.0: Good preservation, some capability loss possible
- > 1.0: Significant damage, noticeable quality degradation
Related Topics
How Heretic Works
System architecture and workflow overview
Directional Ablation
Learn about refusal directions and orthogonalization
