Overview
The--print-residual-geometry flag provides a comprehensive quantitative analysis of how residual vectors for “harmful” and “harmless” prompts relate to each other. This generates a detailed table packed with metrics that facilitate understanding of refusal mechanisms in transformer models.
Enabling Geometry Analysis
To print residual geometry metrics, use the--print-residual-geometry flag:
Example Output
Here is the geometry analysis table for gemma-3-270m-it:Understanding the Metrics
The geometry analysis table includes the following vectors and metrics:Vectors
- g = Mean of residual vectors for good (harmless) prompts
- g* = Geometric median of residual vectors for good prompts
- b = Mean of residual vectors for bad (harmful) prompts
- b* = Geometric median of residual vectors for bad prompts
- r = Refusal direction for means (i.e., b - g)
- r* = Refusal direction for geometric medians (i.e., b* - g*)
Similarity Metrics
S(x,y) = Cosine similarity of vectors x and y Cosine similarity ranges from -1 to 1:- 1.0 = Vectors point in exactly the same direction
- 0.0 = Vectors are orthogonal (perpendicular)
- -1.0 = Vectors point in opposite directions
-
S(g,b) - How similar are mean vectors for good/bad prompts?
- High values (close to 1.0) indicate the residuals are very similar overall
- This is typically high in early layers
-
S(g*,b*) - Same as S(g,b) but using geometric medians
- Geometric medians are more robust to outliers than means
- Usually very similar to S(g,b)
-
S(g,r) - How aligned is the good direction with the refusal direction?
- Positive values mean the refusal direction points away from good prompts
- Negative values mean the refusal direction points toward good prompts
- S(g*,r*) - Same as S(g,r) but using geometric medians
-
S(b,r) - How aligned is the bad direction with the refusal direction?
- Should typically be positive and relatively high
- Indicates the refusal direction captures the harmful prompt representation
- S(b*,r*) - Same as S(b,r) but using geometric medians
Norm Metrics
|x| = L2 norm (Euclidean magnitude) of vector x The L2 norm measures the “size” or “magnitude” of a vector:- |g| and |g*| - Magnitude of good prompt representations
- |b| and |b*| - Magnitude of bad prompt representations
- |r| and |r*| - Magnitude of refusal directions
Norm magnitudes typically increase through the layers as representations become more complex, then may decrease in final layers.
Clustering Metrics
Silh = Mean silhouette coefficient of residuals for good/bad clusters The silhouette coefficient measures how well the residuals cluster into two distinct groups:- Range: -1 to 1
- > 0.5 = Strong, well-separated clusters
- 0.2 - 0.5 = Moderate separation (weak structure)
- < 0.2 = Weak separation (overlapping clusters)
- < 0 = Points may be assigned to wrong clusters
Interpreting the Output
Understanding Geometric Medians vs Means
Heretic computes both means and geometric medians for residual vectors:Means (g, b, r)
- Standard arithmetic average across all residual vectors
- Sensitive to outliers - extreme values can skew the result
- Computationally simple and fast
- Traditional choice for difference-of-means refusal directions
Geometric Medians (g*, b*, r*)
- The point that minimizes the sum of distances to all residual vectors
- Robust to outliers - extreme values have less influence
- More computationally expensive to compute
- Can provide more stable refusal directions for noisy datasets
Refusal Direction Analysis
The refusal direction r (or r*) is the key vector that directional ablation removes. Here’s how to analyze it:Strong Refusal Signals
Layers with strong refusal signals typically show:- High |r| - Large refusal direction magnitude (e.g., > 100 in the example)
- High Silh - Good cluster separation (e.g., > 0.15)
- High S(b,r) - Bad prompts align with refusal direction (e.g., > 0.5)
- Moderate to high |S(g,r)| - Good prompts either align or anti-align with refusal
- Refusal magnitude |r| = 743.95 (large)
- Silhouette = 0.2863 (highest in the table)
- S(b,r) = 0.8579 (very high alignment)
- S(g,r) = 0.8189 (also high, indicating refusal affects both directions)
Weak Refusal Signals
Layers with weak refusal signals typically show:- Low |r| - Small refusal direction magnitude
- Low Silh - Poor cluster separation (e.g., < 0.1)
- Low |S(g,r)| and Low |S(b,r)| - Neither direction strongly aligns
- Refusal magnitude |r| = 1.19 (very small)
- Silhouette = 0.0480 (very low)
- S(g,b) = 1.0000 (clusters are nearly identical)
Directional Consistency
Compare S(g,r) and S(b,r) across layers:- Same sign = Refusal direction is “in between” good and bad representations
- Opposite signs = Refusal direction clearly separates good from bad
- Sign changes across layers = Refusal mechanism evolves through the network
Layer-by-Layer Patterns
Common patterns observed in the geometry analysis:Early Layers (e.g., layers 1-3)
- Very high S(g,b) (≈ 1.0) - Representations are still very similar
- Small |r| - Refusal direction is weak
- Low Silh - Poor cluster separation
- Interpretation: The model hasn’t yet differentiated harmful from harmless
Middle Layers (e.g., layers 8-12)
- Decreasing S(g,b) - Representations diverge
- Growing |r| - Refusal direction strengthens
- Higher Silh - Better cluster separation
- Interpretation: Core refusal mechanisms are active here
Late Layers (e.g., layer 18)
- May show different patterns depending on model architecture
- Sometimes |r| is still large (refusal persists)
- Sometimes patterns reverse (refusal resolved)
- Interpretation: Model prepares final output representation
Configuration
The geometry analysis uses the same datasets configured for the main abliteration process:Use Cases
Identifying Optimal Ablation Layers
Use the silhouette coefficient and refusal magnitude to identify which layers have the strongest refusal signals:- Sort layers by Silh (descending)
- Look for layers with both high Silh and high |r|
- These layers are prime candidates for targeted ablation
Validating Ablation Parameters
Compare geometry before and after ablation:- |r| should decrease in ablated layers
- Silh should decrease (clusters should be less separated)
- S(g,b) should increase (representations should be more similar)
Understanding Model Architecture
Different architectures show different geometric patterns:- Some models have refusal concentrated in specific layers
- Others distribute refusal across many layers
- MoE models may show different patterns per expert
Research and Interpretability
The metrics provide quantitative data for research questions:- How do refusal directions evolve during training?
- How do different alignment techniques affect geometric properties?
- Are geometric medians more effective than means for ablation?
- What is the relationship between Silh and ablation effectiveness?
Implementation Details
Fromanalyzer.py, the geometry analysis:
- Computes means using
Tensor.mean(dim=0) - Computes geometric medians using the
geom-medianlibrary - Calculates cosine similarities using
F.cosine_similarity() - Calculates L2 norms using
LA.vector_norm() - Computes silhouette coefficients using
sklearn.metrics.silhouette_score()
Geometric median computation is performed on CPU and may take a few seconds for large models with many layers.
Combining with Residual Plots
For comprehensive analysis, use both tools together:- Quantitative metrics from geometry analysis
- Visual intuition from residual plots
- Temporal evolution from animated GIFs
