HERETIC_ prefix) or in a config.toml file.
Model Loading
HuggingFace model ID or path to model on disk.Examples:
If provided as the last argument without
--model flag, it will be automatically recognized as the model parameter.Model ID or path to evaluate against the main model instead of performing abliteration.Example:This compares the refusals and KL divergence of the evaluated model relative to the base model.
List of PyTorch dtypes to try when loading model tensors. If loading with a dtype fails, the next dtype in the list will be tried.Example:
Quantization method to use when loading the model.Options:
none: No quantization (full precision)bnb_4bit: 4-bit quantization using bitsandbytes
4-bit quantization can reduce VRAM requirements by ~75% with minimal quality impact, enabling processing of larger models on consumer GPUs.
Device map to pass to Accelerate when loading the model.Examples:
Maximum memory to allocate per device. Useful for multi-GPU setups or when sharing GPU with other processes.Example (requires config file):
Whether to trust remote code when loading the model. Some models require custom code that must be explicitly trusted.Example:
Performance & Optimization
Number of input sequences to process in parallel. Set to 0 for automatic determination.Example:
Maximum batch size to try when automatically determining the optimal batch size.Example:
Maximum number of tokens to generate for each response during evaluation.Example:Longer responses take more time but may improve refusal detection accuracy.
Optimization Parameters
Number of abliteration trials to run during optimization.Example:
More trials increase the chance of finding better parameters but take longer. 200 is a good balance for most use cases.
Number of trials that use random sampling for exploration before switching to TPE (Tree-structured Parzen Estimator) optimization.Example:Higher values improve initial exploration but delay focused optimization.
Directory to save and load study progress to/from.Example:Checkpoints enable resuming interrupted runs and reviewing previous results.
Assumed “typical” value of the Kullback-Leibler divergence for abliterated models. Used to ensure balanced co-optimization of KL divergence and refusal count.Example:
KL divergence target threshold. Below this value, optimization focuses on refusal count. This prevents exploring parameters that have no effect.Example:
Abliteration Method
Whether to adjust refusal directions so that only the component orthogonal to the “good” direction is subtracted during abliteration.Example:Implements projected abliteration. May improve capability retention in some models.
How to apply row normalization of the weights.Options:Implements norm-preserving abliteration.
none: No normalizationpre: Compute LoRA adapter relative to row-normalized weightsfull: Likepre, but renormalizes to preserve original row magnitudes
Rank of the LoRA adapter when
full row normalization is used. Higher ranks provide better approximation but increase file size and evaluation time.Example:Symmetric winsorization quantile for per-prompt, per-layer residual vectors (between 0 and 1). Disabled by default (1.0).Example:This clamps residual magnitudes to the specified quantile, taming “massive activations” in some models. Value of 0.95 means components are clamped to the 95th percentile magnitude.
Evaluation & Datasets
Strings whose presence in a response (case-insensitive) identifies it as a refusal.Default includes:
sorry, i cannot, as an ai, harmful, unethical, etc.Example (config file):System prompt to use when prompting the model.Example:
Dataset Configuration
Heretic uses four datasets for training and evaluation. Each dataset can be configured with these sub-parameters:Dataset of prompts that tend to NOT result in refusals (used for calculating refusal directions).Default:
Dataset of prompts that tend to result in refusals (used for calculating refusal directions).Default:
Dataset of harmless prompts used for evaluating model performance (KL divergence measurement).Default:
Dataset of harmful prompts used for evaluating model performance (refusal counting).Default:
Datasets can be HuggingFace dataset IDs or local file paths. The split parameter uses HuggingFace slice notation.
Research Features
Whether to print prompt/response pairs when counting refusals.Example:Useful for debugging refusal detection or understanding model behavior.
Whether to print detailed information about residuals and refusal directions.Example:Outputs a detailed table with per-layer metrics including:
- Cosine similarities between good/bad/refusal directions
- L2 norms of direction vectors
- Silhouette coefficients for clustering quality
Whether to generate plots showing PaCMAP projections of residual vectors.Example:Generates:
- PNG image for each transformer layer
- Animated GIF showing transformation between layers
Base path to save plots of residual vectors.Example:
Title placed above plots of residual vectors.Example:
Matplotlib style sheet to use for plots of residual vectors.Example:See Matplotlib style sheets for available options.
Configuration File Example
Instead of long command lines, createconfig.toml in your working directory:
Environment Variables
Any option can be set via environment variable with theHERETIC_ prefix:
