Purpose
In addition to its primary function of removing model censorship, Heretic provides features designed to support research into the semantics of model internals (interpretability). These tools help researchers and practitioners understand how refusal mechanisms work in language models by visualizing and analyzing residual vectors across transformer layers.Installation
To use research features, you need to install Heretic with the optionalresearch extra:
Research Dependencies
The research extra installs the following additional packages:- geom-median (~0.1) - For computing geometric medians of residual vectors
- imageio (~2.37) - For generating animated GIFs from residual plots
- matplotlib (~3.10) - For creating visualizations of residual vectors
- numpy (~2.2) - For numerical operations on residual data
- pacmap (~0.8) - For PaCMAP dimensionality reduction projections
- scikit-learn (~1.7) - For computing clustering metrics like silhouette coefficients
Use Cases
Heretic’s research features are valuable for:- Interpretability Research - Understanding how models represent “harmful” vs “harmless” concepts in their hidden states
- Direction Analysis - Examining the geometric properties of refusal directions across layers
- Ablation Validation - Visualizing how residual spaces change before and after directional ablation
- Model Comparison - Comparing refusal mechanisms across different model architectures
- Layer Behavior - Studying how representations evolve through transformer layers
Available Research Tools
Heretic provides two primary research capabilities:Residual Vector Plots
Generate PaCMAP projections showing how residual vectors cluster for “harmful” and “harmless” prompts across layers
Geometry Analysis
Print detailed metrics about cosine similarities, norms, and clustering quality of residual vectors
Research features require additional computational resources. PaCMAP projections for larger models can take an hour or more to compute on CPU.
