Skip to main content
This page documents all command-line options available in Heretic. Options can also be set via environment variables (with HERETIC_ prefix) or in a config.toml file.

Model Loading

model
string
required
HuggingFace model ID or path to model on disk.Examples:
heretic meta-llama/Llama-3.1-8B-Instruct
heretic /path/to/local/model
heretic --model Qwen/Qwen3-4B-Instruct-2507
If provided as the last argument without --model flag, it will be automatically recognized as the model parameter.
evaluate-model
string
default:"null"
Model ID or path to evaluate against the main model instead of performing abliteration.Example:
heretic --model google/gemma-3-12b-it \
  --evaluate-model p-e-w/gemma-3-12b-it-heretic
This compares the refusals and KL divergence of the evaluated model relative to the base model.
dtypes
list[string]
List of PyTorch dtypes to try when loading model tensors. If loading with a dtype fails, the next dtype in the list will be tried.Example:
heretic --dtypes auto float16 MODEL_NAME
quantization
string
default:"none"
Quantization method to use when loading the model.Options:
  • none: No quantization (full precision)
  • bnb_4bit: 4-bit quantization using bitsandbytes
Example:
heretic --quantization bnb_4bit MODEL_NAME
4-bit quantization can reduce VRAM requirements by ~75% with minimal quality impact, enabling processing of larger models on consumer GPUs.
device-map
string | dict
default:"auto"
Device map to pass to Accelerate when loading the model.Examples:
# Automatic device mapping
heretic --device-map auto MODEL_NAME

# Manual device mapping (use config file)
# config.toml:
# device_map = {"model.embed": 0, "model.layers": 1}
max-memory
dict
default:"null"
Maximum memory to allocate per device. Useful for multi-GPU setups or when sharing GPU with other processes.Example (requires config file):
max_memory = {"0": "20GB", "1": "20GB", "cpu": "64GB"}
trust-remote-code
boolean
default:"null"
Whether to trust remote code when loading the model. Some models require custom code that must be explicitly trusted.Example:
heretic --trust-remote-code MODEL_NAME
Only enable for models from trusted sources, as remote code can execute arbitrary Python.

Performance & Optimization

batch-size
integer
default:"0"
Number of input sequences to process in parallel. Set to 0 for automatic determination.Example:
heretic --batch-size 8 MODEL_NAME
Automatic batch size detection (default) is recommended. It benchmarks your hardware to find the optimal throughput.
max-batch-size
integer
default:"128"
Maximum batch size to try when automatically determining the optimal batch size.Example:
heretic --max-batch-size 64 MODEL_NAME
max-response-length
integer
default:"100"
Maximum number of tokens to generate for each response during evaluation.Example:
heretic --max-response-length 150 MODEL_NAME
Longer responses take more time but may improve refusal detection accuracy.

Optimization Parameters

n-trials
integer
default:"200"
Number of abliteration trials to run during optimization.Example:
heretic --n-trials 300 MODEL_NAME
More trials increase the chance of finding better parameters but take longer. 200 is a good balance for most use cases.
n-startup-trials
integer
default:"60"
Number of trials that use random sampling for exploration before switching to TPE (Tree-structured Parzen Estimator) optimization.Example:
heretic --n-startup-trials 80 MODEL_NAME
Higher values improve initial exploration but delay focused optimization.
study-checkpoint-dir
string
default:"checkpoints"
Directory to save and load study progress to/from.Example:
heretic --study-checkpoint-dir ./my-checkpoints MODEL_NAME
Checkpoints enable resuming interrupted runs and reviewing previous results.
kl-divergence-scale
float
default:"1.0"
Assumed “typical” value of the Kullback-Leibler divergence for abliterated models. Used to ensure balanced co-optimization of KL divergence and refusal count.Example:
heretic --kl-divergence-scale 0.5 MODEL_NAME
kl-divergence-target
float
default:"0.01"
KL divergence target threshold. Below this value, optimization focuses on refusal count. This prevents exploring parameters that have no effect.Example:
heretic --kl-divergence-target 0.02 MODEL_NAME

Abliteration Method

orthogonalize-direction
boolean
default:"false"
Whether to adjust refusal directions so that only the component orthogonal to the “good” direction is subtracted during abliteration.Example:
heretic --orthogonalize-direction MODEL_NAME
Implements projected abliteration. May improve capability retention in some models.
row-normalization
string
default:"none"
How to apply row normalization of the weights.Options:
  • none: No normalization
  • pre: Compute LoRA adapter relative to row-normalized weights
  • full: Like pre, but renormalizes to preserve original row magnitudes
Example:
heretic --row-normalization pre MODEL_NAME
Implements norm-preserving abliteration.
full-normalization-lora-rank
integer
default:"3"
Rank of the LoRA adapter when full row normalization is used. Higher ranks provide better approximation but increase file size and evaluation time.Example:
heretic --row-normalization full \
  --full-normalization-lora-rank 5 \
  MODEL_NAME
winsorization-quantile
float
default:"1.0"
Symmetric winsorization quantile for per-prompt, per-layer residual vectors (between 0 and 1). Disabled by default (1.0).Example:
heretic --winsorization-quantile 0.95 MODEL_NAME
This clamps residual magnitudes to the specified quantile, taming “massive activations” in some models. Value of 0.95 means components are clamped to the 95th percentile magnitude.

Evaluation & Datasets

refusal-markers
list[string]
default:"[see config.default.toml]"
Strings whose presence in a response (case-insensitive) identifies it as a refusal.Default includes: sorry, i cannot, as an ai, harmful, unethical, etc.Example (config file):
refusal_markers = [
  "i cannot",
  "i'm unable",
  "inappropriate",
  "against my guidelines",
]
system-prompt
string
default:"You are a helpful assistant."
System prompt to use when prompting the model.Example:
heretic --system-prompt "You are a helpful AI." MODEL_NAME

Dataset Configuration

Heretic uses four datasets for training and evaluation. Each dataset can be configured with these sub-parameters:
good-prompts
object
Dataset of prompts that tend to NOT result in refusals (used for calculating refusal directions).Default:
[good_prompts]
dataset = "mlabonne/harmless_alpaca"
split = "train[:400]"
column = "text"
prefix = ""
suffix = ""
system_prompt = null  # Uses global system_prompt if null
bad-prompts
object
Dataset of prompts that tend to result in refusals (used for calculating refusal directions).Default:
[bad_prompts]
dataset = "mlabonne/harmful_behaviors"
split = "train[:400]"
column = "text"
good-evaluation-prompts
object
Dataset of harmless prompts used for evaluating model performance (KL divergence measurement).Default:
[good_evaluation_prompts]
dataset = "mlabonne/harmless_alpaca"
split = "test[:100]"
column = "text"
bad-evaluation-prompts
object
Dataset of harmful prompts used for evaluating model performance (refusal counting).Default:
[bad_evaluation_prompts]
dataset = "mlabonne/harmful_behaviors"
split = "test[:100]"
column = "text"
Custom Dataset Example:
[bad_prompts]
dataset = "my-org/custom-harmful-prompts"
split = "train[:500]"
column = "prompt_text"
prefix = "[INST] "
suffix = " [/INST]"
system_prompt = "You are an AI assistant."
Datasets can be HuggingFace dataset IDs or local file paths. The split parameter uses HuggingFace slice notation.

Research Features

Research features require the research extra: pip install heretic-llm[research]
print-responses
boolean
default:"false"
Whether to print prompt/response pairs when counting refusals.Example:
heretic --print-responses MODEL_NAME
Useful for debugging refusal detection or understanding model behavior.
print-residual-geometry
boolean
default:"false"
Whether to print detailed information about residuals and refusal directions.Example:
heretic --print-residual-geometry MODEL_NAME
Outputs a detailed table with per-layer metrics including:
  • Cosine similarities between good/bad/refusal directions
  • L2 norms of direction vectors
  • Silhouette coefficients for clustering quality
plot-residuals
boolean
default:"false"
Whether to generate plots showing PaCMAP projections of residual vectors.Example:
heretic --plot-residuals MODEL_NAME
Generates:
  • PNG image for each transformer layer
  • Animated GIF showing transformation between layers
PaCMAP projection is CPU-intensive and can take over an hour for large models.
residual-plot-path
string
default:"plots"
Base path to save plots of residual vectors.Example:
heretic --plot-residuals \
  --residual-plot-path ./visualizations \
  MODEL_NAME
residual-plot-title
string
Title placed above plots of residual vectors.Example:
heretic --plot-residuals \
  --residual-plot-title "My Model Analysis" \
  MODEL_NAME
residual-plot-style
string
default:"dark_background"
Matplotlib style sheet to use for plots of residual vectors.Example:
heretic --plot-residuals \
  --residual-plot-style seaborn-v0_8 \
  MODEL_NAME
See Matplotlib style sheets for available options.

Configuration File Example

Instead of long command lines, create config.toml in your working directory:
# Model settings
quantization = "bnb_4bit"
max_memory = {"0": "22GB", "cpu": "64GB"}

# Optimization settings
n_trials = 300
n_startup_trials = 100
max_response_length = 150

# Advanced abliteration
orthogonalize_direction = true
row_normalization = "pre"

# Custom system prompt
system_prompt = "You are a helpful, knowledgeable assistant."

# Custom harmful prompts dataset
[bad_prompts]
dataset = "my-org/adversarial-prompts"
split = "train[:600]"
column = "text"

[bad_evaluation_prompts]
dataset = "my-org/adversarial-prompts"
split = "test[:150]"
column = "text"
Then run:
heretic my-model-name
All settings from the config file will be applied automatically.

Environment Variables

Any option can be set via environment variable with the HERETIC_ prefix:
export HERETIC_QUANTIZATION=bnb_4bit
export HERETIC_N_TRIALS=300
heretic MODEL_NAME
Environment variables are useful for containerized deployments or when you want to override config file settings temporarily.

Help Command

For a quick reference of all options:
heretic --help
This displays a summary of all available command-line flags with their descriptions and default values.

Build docs developers (and LLMs) love