General Questions
What is abliteration?
What is abliteration?
Abliteration (or “directional ablation”) is a technique for removing censorship from language models by modifying their internal weight matrices.
How It Works
The technique was introduced by Arditi et al. (2024) and works by:- Identifying refusal directions: Computing the difference in hidden states between “harmful” and “harmless” prompts
- Orthogonalizing weight matrices: Modifying specific transformer components (attention out-projection and MLP down-projection) to inhibit these refusal directions
- Preserving model capabilities: Making surgical changes that remove censorship while minimizing damage to the model’s general intelligence
Why “Abliteration”?
The term combines:- Ablation: Removing or inhibiting specific model behaviors
- Obliteration: Completely eliminating censorship responses
Heretic’s Innovations
Heretic advances the original abliteration technique:- Automatic parameter optimization: Uses TPE (Tree-structured Parzen Estimator) to find optimal ablation parameters
- Flexible weight kernels: Adjusts ablation strength across different layers
- Direction interpolation: Uses fractional direction indices to explore directions beyond individual layers
- Component-specific ablation: Applies different parameters to attention vs. MLP components
Unlike fine-tuning or RLHF, abliteration doesn’t require training data or GPU-intensive optimization loops. It’s a direct weight modification that takes minutes to hours rather than days.
How long does it take to process a model?
How long does it take to process a model?
Processing time depends on several factors:
Benchmark: Llama-3.1-8B-Instruct takes approximately 45 minutes on an RTX 3090 with default configuration.
By Model Size (RTX 3090, default settings)
| Model Size | Approximate Time |
|---|---|
| 1-4B parameters | 15-30 minutes |
| 7-8B parameters | 45-60 minutes |
| 13B parameters | 1-2 hours |
| 20-30B parameters | 2-3 hours |
| 70B+ parameters | 4-8 hours |
Factors Affecting Speed
Hardware:- GPU model (faster GPUs = faster processing)
- VRAM capacity (affects batch size)
- CPU speed (for CPU-based operations like PaCMAP)
- Number of trials (
--n-trials, default ~20) - Batch size (auto-determined or manual)
- Quantization (slightly slower but uses less VRAM)
- Number of layers
- Hidden dimension size
- MoE models take longer than dense models
Phases of Processing
- Model loading: 1-5 minutes (depends on model size and quantization)
- Batch size benchmarking: 1-2 minutes
- Computing refusal directions: 5-15 minutes
- Optimization trials: 30 minutes - several hours (bulk of time)
- Model merging (if saving): 5-20 minutes
Do I need a GPU?
Do I need a GPU?
Technically no, but practically yes.CPU processing is significantly slower:
Without a GPU
Heretic can run on CPU, but:- An 8B model might take 12-24 hours instead of 45 minutes
- Larger models (20B+) could take days
- System RAM requirements are higher
Supported Accelerators
Heretic supports multiple hardware accelerators:- CUDA (NVIDIA GPUs) - Most tested and recommended
- XPU (Intel GPUs)
- MLU (Cambricon)
- SDAA (SD Technology)
- MUSA (Moore Threads)
- NPU (Ascend)
- MPS (Apple Metal on M-series Macs)
Recommended Hardware
| Model Size | Minimum VRAM | Recommended VRAM |
|---|---|---|
| 7-8B | 16 GB (with quantization) | 24 GB |
| 13B | 24 GB (with quantization) | 40 GB |
| 20-30B | 40 GB (with quantization) | 80 GB |
| 70B | 80 GB (with quantization) | 160 GB |
Can I use quantization?
Can I use quantization?
Yes! Quantization is highly recommended for large models or limited VRAM.Or in
Enabling Quantization
config.toml:Benefits
- ~4x VRAM reduction: Loads models in 4-bit precision instead of 16-bit
- Enables larger models: Run 13B models on 16GB VRAM, 70B on 48GB
- Quality preservation: Quantization affects loading, but final model can be full precision
Important Considerations
The Trade-off
During processing:- Low VRAM usage
- Slightly slower than full precision
- Fully functional optimization
- Must reload base model at full precision
- Merge LoRA adapter into base model
- Requires substantial system RAM
Best Practices
- Use quantization during processing
- Test the model using the built-in chat feature
- Only save/upload if you have sufficient RAM, or:
- Use a high-RAM instance specifically for the merge step
How do results compare to manual abliteration?
How do results compare to manual abliteration?
Heretic produces comparable or better results than manual abliteration by experts.
Benchmark: Gemma-3-12B-IT
Comparison with manually abliterated versions:| Model | Creator | Refusals | KL Divergence | Notes |
|---|---|---|---|---|
| Original | 97/100 | 0.00 | Baseline | |
| Abliterated v2 | mlabonne | 3/100 | 1.04 | Manual parameters |
| Abliterated | huihui-ai | 3/100 | 0.45 | Manual parameters |
| Heretic | Automatic | 3/100 | 0.16 | Automatic optimization |
What This Means
Refusals (lower is better):- All abliteration methods achieve similar refusal suppression
- Heretic: 3/100 (97% refusal removal)
- Comparable to best manual efforts
- Measures how much the model’s behavior has changed
- Lower KL divergence = better preservation of original capabilities
- Heretic achieves 0.16 vs. 0.45-1.04 for manual methods
- Values above 1.0 typically indicate significant capability damage
User Feedback
From the community:“I just downloaded GPT-OSS 20B Heretic model and holy shit. It gives properly formatted long responses to sensitive topics, using the exact uncensored words that you would expect from an uncensored model, produces markdown format tables with details and whatnot. Looks like this is the best abliterated version of this model so far…”
“Heretic GPT 20b seems to be the best uncensored model I have tried yet. It doesn’t destroy the model’s intelligence and it is answering prompts normally would be rejected by the base model.”
“Qwen3-4B-Instruct-2507-heretic has been the best unquantized abliterated model that I have been able to run on 16gb vram.”
Why Automatic Is Better
Manual abliteration challenges:- Requires deep understanding of transformer internals
- Trial-and-error parameter tuning
- Time-intensive experimentation
- Hard to find optimal balance between refusal suppression and capability preservation
- Explores thousands of parameter combinations
- Uses sophisticated optimization (TPE algorithm)
- Optimizes for both refusal reduction AND capability preservation simultaneously
- Anyone can achieve expert-level results
The benchmarks were compiled using PyTorch 2.8 on an RTX 5090. Exact values may vary slightly by platform and hardware, but relative differences remain consistent.
What models can I use?
What models can I use?
See the Supported Models page for comprehensive information.If incompatible, you’ll get an error during loading or abliteration.
Quick Summary
✅ Supported:- Most dense transformer models
- Many multimodal (vision-language) models
- Several MoE (Mixture of Experts) architectures
- Standard decoder-only transformers
- State Space Models (SSMs) like Mamba
- Hybrid architectures (e.g., Jamba)
- Models with inhomogeneous layers
- Novel attention mechanisms that deviate from standard transformers
Finding Models to Use
Community models:- Over 1,000 Heretic models on Hugging Face
- Search for models tagged with
heretic
- The Bestiary
- Hand-picked high-quality examples
Testing Compatibility
The easiest way to check compatibility:How do I share my model?
How do I share my model?
Can I resume interrupted processing?
Can I resume interrupted processing?
Yes! Heretic automatically saves progress and can resume from interruptions.Or custom location with:Heretic will detect the checkpoint and prompt:Continue the previous run:This lets you:This extends the optimization without starting over.
How Checkpointing Works
Heretic uses Optuna’s journal storage to save:- Completed trials and their results
- Configuration parameters
- Best found parameters so far
Interrupting Safely
Press Ctrl+C at any time:- Current trial will be marked as pruned
- All previous completed trials are saved
- You can resume later from the same point
Resuming a Run
Run the same command again:- Resumes optimization from where it stopped
- Uses the original configuration (ignores new CLI arguments)
- Completes remaining trials
- Deletes the checkpoint file
- Begins fresh optimization
- Uses current configuration
Completed Runs
If you finished all trials and run again:- Export models from completed runs
- Run additional trials for better results
- Test different trials from the Pareto front
Adding More Trials
After viewing results, you can:Getting More Help
Discord Community
Join the community for real-time help and discussion
GitHub Repository
Report issues, request features, or browse the source code
