Residual Geometry Analysis

Overview

The --print-residual-geometry flag provides a comprehensive quantitative analysis of how residual vectors for “harmful” and “harmless” prompts relate to each other. This generates a detailed table packed with metrics that facilitate understanding of refusal mechanisms in transformer models.

Enabling Geometry Analysis

To print residual geometry metrics, use the --print-residual-geometry flag:

heretic Qwen/Qwen3-4B-Instruct-2507 --print-residual-geometry

You must install Heretic with the research extra for this feature:

pip install -U heretic-llm[research]

Example Output

Here is the geometry analysis table for gemma-3-270m-it:

┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃ Layer ┃ S(g,b) ┃ S(g*,b*) ┃  S(g,r) ┃ S(g*,r*) ┃  S(b,r) ┃ S(b*,r*) ┃      |g| ┃     |g*| ┃      |b| ┃     |b*| ┃     |r| ┃    |r*| ┃   Silh ┃
┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│     1 │ 1.0000 │   1.0000 │ -0.4311 │  -0.4906 │ -0.4254 │  -0.4847 │   170.29 │   170.49 │   169.78 │   169.85 │    1.19 │    1.31 │ 0.0480 │
│     2 │ 1.0000 │   1.0000 │  0.4297 │   0.4465 │  0.4365 │   0.4524 │   768.55 │   768.77 │   771.32 │   771.36 │    6.39 │    5.76 │ 0.0745 │
│     3 │ 0.9999 │   1.0000 │ -0.5699 │  -0.5577 │ -0.5614 │  -0.5498 │  1020.98 │  1021.13 │  1013.80 │  1014.71 │   12.70 │   11.60 │ 0.0920 │
│     4 │ 0.9999 │   1.0000 │  0.6582 │   0.6553 │  0.6659 │   0.6627 │  1356.39 │  1356.20 │  1368.71 │  1367.95 │   18.62 │   17.84 │ 0.0957 │
│     5 │ 0.9987 │   0.9990 │ -0.6880 │  -0.6761 │ -0.6497 │  -0.6418 │   766.54 │   762.25 │   731.75 │   732.42 │   51.97 │   45.24 │ 0.1018 │
│     6 │ 0.9998 │   0.9998 │ -0.1983 │  -0.2312 │ -0.1811 │  -0.2141 │  2417.35 │  2421.08 │  2409.18 │  2411.40 │   43.06 │   43.47 │ 0.0900 │
│     7 │ 0.9998 │   0.9997 │ -0.5258 │  -0.5746 │ -0.5072 │  -0.5560 │  3444.92 │  3474.99 │  3400.01 │  3421.63 │   86.94 │   94.38 │ 0.0492 │
│     8 │ 0.9990 │   0.9991 │  0.8235 │   0.8312 │  0.8479 │   0.8542 │  4596.54 │  4615.62 │  4918.32 │  4934.20 │  384.87 │  377.87 │ 0.2278 │
│     9 │ 0.9992 │   0.9992 │  0.5335 │   0.5441 │  0.5678 │   0.5780 │  5322.30 │  5316.96 │  5468.65 │  5466.98 │  265.68 │  267.28 │ 0.1318 │
│    10 │ 0.9974 │   0.9973 │  0.8189 │   0.8250 │  0.8579 │   0.8644 │  5328.81 │  5325.63 │  5953.35 │  5985.15 │  743.95 │  779.74 │ 0.2863 │
│    11 │ 0.9977 │   0.9978 │  0.4262 │   0.4045 │  0.4862 │   0.4645 │  9644.02 │  9674.06 │  9983.47 │  9990.28 │  743.28 │  726.99 │ 0.1576 │
│    12 │ 0.9904 │   0.9907 │  0.4384 │   0.4077 │  0.5586 │   0.5283 │ 10257.40 │ 10368.50 │ 11114.51 │ 11151.21 │ 1711.18 │ 1664.69 │ 0.1890 │
│    13 │ 0.9867 │   0.9874 │  0.4007 │   0.3680 │  0.5444 │   0.5103 │ 12305.12 │ 12423.75 │ 13440.31 │ 13432.47 │ 2386.43 │ 2282.47 │ 0.1293 │
│    14 │ 0.9921 │   0.9922 │  0.3198 │   0.2682 │  0.4364 │   0.3859 │ 16929.16 │ 17080.37 │ 17826.97 │ 17836.03 │ 2365.23 │ 2301.87 │ 0.1282 │
│    15 │ 0.9846 │   0.9850 │  0.1198 │   0.0963 │  0.2913 │   0.2663 │ 16858.58 │ 16949.44 │ 17496.00 │ 17502.88 │ 3077.08 │ 3029.60 │ 0.1611 │
│    16 │ 0.9686 │   0.9689 │ -0.0029 │  -0.0254 │  0.2457 │   0.2226 │ 18912.77 │ 19074.86 │ 19510.56 │ 19559.62 │ 4848.35 │ 4839.75 │ 0.1516 │
│    17 │ 0.9782 │   0.9784 │ -0.0174 │  -0.0381 │  0.1908 │   0.1694 │ 27098.09 │ 27273.00 │ 27601.12 │ 27653.12 │ 5738.19 │ 5724.21 │ 0.1641 │
│    18 │ 0.9184 │   0.9196 │  0.1343 │   0.1430 │  0.5155 │   0.5204 │   190.16 │   190.35 │   219.91 │   220.62 │   87.82 │   87.59 │ 0.1855 │
└───────┴────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┴──────────┴──────────┴─────────┴─────────┴────────┘

Understanding the Metrics

The geometry analysis table includes the following vectors and metrics:

Vectors

g = Mean of residual vectors for good (harmless) prompts
g* = Geometric median of residual vectors for good prompts
b = Mean of residual vectors for bad (harmful) prompts
b* = Geometric median of residual vectors for bad prompts
r = Refusal direction for means (i.e., b - g)
r* = Refusal direction for geometric medians (i.e., b* - g*)

Similarity Metrics

S(x,y) = Cosine similarity of vectors x and y Cosine similarity ranges from -1 to 1:

1.0 = Vectors point in exactly the same direction
0.0 = Vectors are orthogonal (perpendicular)
-1.0 = Vectors point in opposite directions

Key similarity columns:

S(g,b) - How similar are mean vectors for good/bad prompts?
- High values (close to 1.0) indicate the residuals are very similar overall
- This is typically high in early layers
S(g*,b*) - Same as S(g,b) but using geometric medians
- Geometric medians are more robust to outliers than means
- Usually very similar to S(g,b)
S(g,r) - How aligned is the good direction with the refusal direction?
- Positive values mean the refusal direction points away from good prompts
- Negative values mean the refusal direction points toward good prompts
S(g*,r*) - Same as S(g,r) but using geometric medians
S(b,r) - How aligned is the bad direction with the refusal direction?
- Should typically be positive and relatively high
- Indicates the refusal direction captures the harmful prompt representation
S(b*,r*) - Same as S(b,r) but using geometric medians

Norm Metrics

|x| = L2 norm (Euclidean magnitude) of vector x The L2 norm measures the “size” or “magnitude” of a vector:

|g| and |g*| - Magnitude of good prompt representations
|b| and |b*| - Magnitude of bad prompt representations
|r| and |r*| - Magnitude of refusal directions

Norm magnitudes typically increase through the layers as representations become more complex, then may decrease in final layers.

Clustering Metrics

Silh = Mean silhouette coefficient of residuals for good/bad clusters The silhouette coefficient measures how well the residuals cluster into two distinct groups:

Range: -1 to 1
> 0.5 = Strong, well-separated clusters
0.2 - 0.5 = Moderate separation (weak structure)
< 0.2 = Weak separation (overlapping clusters)
< 0 = Points may be assigned to wrong clusters

Higher silhouette scores indicate clearer separation between “harmful” and “harmless” representations, suggesting that layer is a good candidate for ablation.

Interpreting the Output

Understanding Geometric Medians vs Means

Heretic computes both means and geometric medians for residual vectors:

Means (g, b, r)

Standard arithmetic average across all residual vectors
Sensitive to outliers - extreme values can skew the result
Computationally simple and fast
Traditional choice for difference-of-means refusal directions

Geometric Medians (g, b, r*)

The point that minimizes the sum of distances to all residual vectors
Robust to outliers - extreme values have less influence
More computationally expensive to compute
Can provide more stable refusal directions for noisy datasets

If mean and geometric median metrics differ significantly, it suggests the presence of outlier residual vectors. The geometric median is generally more reliable in such cases.

Refusal Direction Analysis

The refusal direction r (or r*) is the key vector that directional ablation removes. Here’s how to analyze it:

Strong Refusal Signals

Layers with strong refusal signals typically show:

High |r| - Large refusal direction magnitude (e.g., > 100 in the example)
High Silh - Good cluster separation (e.g., > 0.15)
High S(b,r) - Bad prompts align with refusal direction (e.g., > 0.5)
Moderate to high |S(g,r)| - Good prompts either align or anti-align with refusal

Example from the table above: Layer 10

│    10 │ 0.9974 │   0.9973 │  0.8189 │   0.8250 │  0.8579 │   0.8644 │  5328.81 │  5325.63 │  5953.35 │  5985.15 │  743.95 │  779.74 │ 0.2863 │

Refusal magnitude |r| = 743.95 (large)
Silhouette = 0.2863 (highest in the table)
S(b,r) = 0.8579 (very high alignment)
S(g,r) = 0.8189 (also high, indicating refusal affects both directions)

Weak Refusal Signals

Layers with weak refusal signals typically show:

Low |r| - Small refusal direction magnitude
Low Silh - Poor cluster separation (e.g., < 0.1)
Low |S(g,r)| and Low |S(b,r)| - Neither direction strongly aligns

Example from the table above: Layer 1

│     1 │ 1.0000 │   1.0000 │ -0.4311 │  -0.4906 │ -0.4254 │  -0.4847 │   170.29 │   170.49 │   169.78 │   169.85 │    1.19 │    1.31 │ 0.0480 │

Refusal magnitude |r| = 1.19 (very small)
Silhouette = 0.0480 (very low)
S(g,b) = 1.0000 (clusters are nearly identical)

Directional Consistency

Compare S(g,r) and S(b,r) across layers:

Same sign = Refusal direction is “in between” good and bad representations
Opposite signs = Refusal direction clearly separates good from bad
Sign changes across layers = Refusal mechanism evolves through the network

Layer-by-Layer Patterns

Common patterns observed in the geometry analysis:

Early Layers (e.g., layers 1-3)

Very high S(g,b) (≈ 1.0) - Representations are still very similar
Small |r| - Refusal direction is weak
Low Silh - Poor cluster separation
Interpretation: The model hasn’t yet differentiated harmful from harmless

Middle Layers (e.g., layers 8-12)

Decreasing S(g,b) - Representations diverge
Growing |r| - Refusal direction strengthens
Higher Silh - Better cluster separation
Interpretation: Core refusal mechanisms are active here

Late Layers (e.g., layer 18)

May show different patterns depending on model architecture
Sometimes |r| is still large (refusal persists)
Sometimes patterns reverse (refusal resolved)
Interpretation: Model prepares final output representation

Configuration

The geometry analysis uses the same datasets configured for the main abliteration process:

# Whether to print detailed information about residuals and refusal directions
print_residual_geometry = false

# Dataset configurations
[good_prompts]
dataset = "mlabonne/harmless_alpaca"
split = "train[:400]"
column = "text"

[bad_prompts]
dataset = "mlabonne/harmful_behaviors" 
split = "train[:400]"
column = "text"

You can enable this via configuration file or command-line flag:

heretic Qwen/Qwen3-4B-Instruct-2507 --print-residual-geometry

Use Cases

Identifying Optimal Ablation Layers

Use the silhouette coefficient and refusal magnitude to identify which layers have the strongest refusal signals:

Sort layers by Silh (descending)
Look for layers with both high Silh and high |r|
These layers are prime candidates for targeted ablation

Validating Ablation Parameters

Compare geometry before and after ablation:

|r| should decrease in ablated layers
Silh should decrease (clusters should be less separated)
S(g,b) should increase (representations should be more similar)

Understanding Model Architecture

Different architectures show different geometric patterns:

Some models have refusal concentrated in specific layers
Others distribute refusal across many layers
MoE models may show different patterns per expert

Research and Interpretability

The metrics provide quantitative data for research questions:

How do refusal directions evolve during training?
How do different alignment techniques affect geometric properties?
Are geometric medians more effective than means for ablation?
What is the relationship between Silh and ablation effectiveness?

Implementation Details

From analyzer.py, the geometry analysis:

Computes means using Tensor.mean(dim=0)
Computes geometric medians using the geom-median library
Calculates cosine similarities using F.cosine_similarity()
Calculates L2 norms using LA.vector_norm()
Computes silhouette coefficients using sklearn.metrics.silhouette_score()

Geometric median computation is performed on CPU and may take a few seconds for large models with many layers.

Combining with Residual Plots

For comprehensive analysis, use both tools together:

heretic Qwen/Qwen3-4B-Instruct-2507 \
  --print-residual-geometry \
  --plot-residuals

This provides:

Quantitative metrics from geometry analysis
Visual intuition from residual plots
Temporal evolution from animated GIFs

Together, these tools give you a complete picture of how refusal mechanisms work in the model.

Analysis Features

Residual Geometry Analysis

Overview

Enabling Geometry Analysis

Example Output

Understanding the Metrics

Vectors

Similarity Metrics

Norm Metrics

Clustering Metrics

Interpreting the Output

Understanding Geometric Medians vs Means

Means (g, b, r)

Geometric Medians (g, b, r*)

Refusal Direction Analysis

Strong Refusal Signals

Weak Refusal Signals

Directional Consistency

Layer-by-Layer Patterns

Early Layers (e.g., layers 1-3)

Middle Layers (e.g., layers 8-12)

Late Layers (e.g., layer 18)

Configuration

Use Cases

Identifying Optimal Ablation Layers

Validating Ablation Parameters

Understanding Model Architecture

Research and Interpretability

Implementation Details

Combining with Residual Plots

Build docs developers (and LLMs) love

Analysis Features

​Overview

​Enabling Geometry Analysis

​Example Output

​Understanding the Metrics

​Vectors

​Similarity Metrics

​Norm Metrics

​Clustering Metrics

​Interpreting the Output

​Understanding Geometric Medians vs Means

​Means (g, b, r)

​Geometric Medians (g*, b*, r*)

​Refusal Direction Analysis

​Strong Refusal Signals

​Weak Refusal Signals

​Directional Consistency

​Layer-by-Layer Patterns

​Early Layers (e.g., layers 1-3)

​Middle Layers (e.g., layers 8-12)

​Late Layers (e.g., layer 18)

​Configuration

​Use Cases

​Identifying Optimal Ablation Layers

​Validating Ablation Parameters

​Understanding Model Architecture

​Research and Interpretability

​Implementation Details

​Combining with Residual Plots

Build docs developers (and LLMs) love

Overview

Enabling Geometry Analysis

Example Output

Understanding the Metrics

Vectors

Similarity Metrics

Norm Metrics

Clustering Metrics

Interpreting the Output

Understanding Geometric Medians vs Means

Means (g, b, r)

Geometric Medians (g, b, r*)

Refusal Direction Analysis

Strong Refusal Signals

Weak Refusal Signals

Directional Consistency

Layer-by-Layer Patterns

Early Layers (e.g., layers 1-3)

Middle Layers (e.g., layers 8-12)

Late Layers (e.g., layer 18)

Configuration

Use Cases

Identifying Optimal Ablation Layers

Validating Ablation Parameters

Understanding Model Architecture

Research and Interpretability

Implementation Details

Combining with Residual Plots