What are Residual Vectors?
Residual vectors (also called hidden states or activations) represent the internal state of a transformer model at each layer. By analyzing how these vectors differ between “harmful” and “harmless” prompts, we can identify and remove the directions associated with refusal behavior. Heretic computes residual vectors for the first output token at each transformer layer, capturing the model’s initial response before generation begins.Enabling Residual Plots
To generate residual vector visualizations, use the--plot-residuals flag:
How It Works
When you run Heretic with--plot-residuals, the following process occurs:
Compute Residual Vectors
Extract hidden states for the first output token at each transformer layer for both “harmful” and “harmless” prompts from your configured datasets.
PaCMAP Projection
Perform dimensionality reduction from high-dimensional residual space (typically thousands of dimensions) to 2D space using PaCMAP (Pairwise Controlled Manifold Approximation).
Geometric Alignment
Left-right align the projections of “harmful”/“harmless” residuals by their geometric medians to make projections for consecutive layers more similar. Additionally, PaCMAP is initialized with the previous layer’s projections for each new layer, minimizing disruptive transitions.
PaCMAP Projection Explained
PaCMAP (Pairwise Controlled Manifold Approximation) is a dimensionality reduction technique that preserves both local and global structure better than alternatives like t-SNE or UMAP. Heretic uses PaCMAP to project high-dimensional residual vectors into 2D space while maintaining the geometric relationships between “harmful” and “harmless” clusters. Key features of Heretic’s PaCMAP implementation:- Sequential initialization - Each layer’s projection is initialized with the previous layer’s projection, creating smooth transitions in the animation
- Consistent orientation - Projections are rotated so the geometric medians of the two clusters align horizontally across all layers
- CPU computation - PaCMAP runs on CPU and can be computationally expensive for larger models
For larger models, computing PaCMAP projections for all layers can take an hour or more. The process shows progress tracking so you can monitor its status.
Generated Outputs
Residual plots are saved to a directory structure based on your configuration:PNG Files per Layer
Each layer gets its own PNG file showing:- Blue dots (royalblue by default) - Residual vectors for “harmless” prompts
- Orange dots (darkorange by default) - Residual vectors for “harmful” prompts
- Layer number - Displayed in bottom-right corner
- Model name - Displayed in bottom-left corner
- Title - Configurable title at the top
Animated GIF
Theanimation.gif file shows how residual vectors evolve through the transformer layers:
- Each layer is displayed for 1 second
- Smooth transitions between layers use 20 interpolated frames (50ms each)
- Animation loops continuously
- Helps visualize how the separation between “harmful” and “harmless” clusters changes across layers
Example Visualization
Below is an example residual plot from the README, showing the PaCMAP projection for a specific layer:- Clear separation between “harmless” (blue) and “harmful” (orange) clusters
- The refusal direction is the vector connecting the centers of these clusters
- The quality of separation varies by layer, which informs ablation strategy
Configuration Options
You can customize residual plots using options inconfig.toml or command-line flags:
Output Path
{residual_plot_path}/{model_name}/
Plot Title
Matplotlib Style
dark_background(default)classicseaborn-v0_8- Any other Matplotlib style sheet
Dataset Colors and Labels
Customize the appearance of each dataset:Performance Considerations
Residual plotting involves significant computational overhead:- Memory - All residual vectors for all prompts and layers must be stored in memory
- CPU time - PaCMAP is CPU-bound and runs sequentially through layers
- Storage - PNG files are approximately 100-200 KB each; expect 5-20 MB per model depending on layer count
The plotting process shows two progress bars:
- Computing PaCMAP projections (slower, CPU-bound)
- Generating plots (faster, I/O-bound)
Use Cases
Interpretability Research
Visualize how refusal mechanisms emerge and evolve through transformer layers:- Early layers may show little separation
- Middle layers often show the clearest clustering
- Late layers may show different patterns
Model Comparison
Compare residual spaces across different models:- Architecture differences
- Training procedure effects
- Alignment technique impacts
Ablation Validation
Generate plots before and after ablation to verify that:- Refusal directions are successfully suppressed
- Overall residual structure remains intact
- No unexpected clustering artifacts appear
Layer Selection
Identify which layers have the strongest refusal signals:- Layers with clear separation are good ablation candidates
- Layers with overlapping clusters may be less critical
- Animation reveals where refusal representations first emerge
