Overview
Evaluation configuration controls how Heretic measures model behavior during optimization. This includes the datasets used to calculate refusal directions, the markers that identify refusals, and settings for response generation.Dataset Configuration
Heretic uses four datasets during abliteration:- good_prompts - Prompts that shouldn’t trigger refusals (for computing refusal directions)
- bad_prompts - Prompts that do trigger refusals (for computing refusal directions)
- good_evaluation_prompts - Harmless prompts for evaluating model performance
- bad_evaluation_prompts - Harmful prompts for measuring refusal suppression
The first two datasets are used to calculate refusal directions. The last two evaluate how well the abliteration worked.
Dataset Specification Format
Each dataset is configured using a table in the TOML config:Dataset Fields
Hugging Face dataset ID (e.g.,
"mlabonne/harmless_alpaca") or path to a local dataset on disk.Portion of the dataset to use. Uses Hugging Face dataset slice notation:
"train[:400]"- First 400 examples from train split"test[100:200]"- Examples 100-199 from test split"train"- Entire train split
Name of the column containing the prompt text.
Text to prepend to each prompt. Useful for adding instructions or context.
Text to append to each prompt.
System prompt to use with these prompts. Overrides the global
system_prompt if set.Label for this dataset in residual vector plots (only used with
--plot-residuals).Matplotlib color for this dataset in plots (e.g.,
"royalblue", "darkorange").Default Datasets
Heretic comes with sensible defaults for censorship removal:Notice that the training prompts use the
train split (first 400 examples) and evaluation prompts use the test split (first 100 examples). This prevents data leakage.Custom Use Case: Removing “Slop”
Heretic can remove any kind of systematic bias, not just censorship. Here’s how to configure it to remove purple prose (“slop”) from creative writing models:Refusal Markers
Refusal markers are strings that identify when a model is refusing to respond. Heretic counts responses containing these markers (case-insensitive) as refusals.Default Refusal Markers
Custom Refusal Markers
You can customize refusal markers for different use cases:- Censorship Removal
- Slop Removal
- Language-Specific
System Prompt
The system prompt sets the context for how the model should behave.Individual datasets can override the global system prompt by setting
system_prompt in their dataset configuration.Response Generation Settings
These settings control how Heretic generates responses during evaluation.max_response_length
Maximum number of tokens to generate for each response.- Longer: More likely to detect refusals that appear after preamble, but slower
- Shorter: Faster evaluation, but might miss refusals that appear later in response
- Default (100): Good balance for most models
print_responses
Whether to print prompt/response pairs during evaluation.- Debugging refusal marker detection
- Manually inspecting model outputs
- Verifying that your prompts produce expected behaviors
- Normal optimization runs (cleaner output)
- Automated experiments
- When using many evaluation examples
Complete Example Configurations
Using Custom Datasets
You can use any Hugging Face dataset or local dataset:From Hugging Face Hub
From Local Files
Evaluation Best Practices
Choosing dataset sizes
Choosing dataset sizes
Training datasets (good_prompts, bad_prompts):
- 200-500 examples per dataset is usually sufficient
- Larger datasets provide more robust refusal directions but take longer
- Default of 400 examples works well for most models
- 50-100 examples is sufficient for optimization feedback
- Use 200-500 examples for final evaluation and comparison
- Smaller datasets speed up optimization trials
Avoid data leakage
Avoid data leakage
Ensure training and evaluation datasets don’t overlap:
- Use different splits (e.g.,
trainfor directions,testfor evaluation) - Use different index ranges (e.g.,
[:400]for training,[1000:1100]for eval) - Never use the exact same examples for both purposes
Crafting refusal markers
Crafting refusal markers
Good practices:
- Use partial words to catch variations (
"violat"→ violate, violating, violation) - Include common variations (
"i cant","i can't","i cannot") - Keep markers short (2-3 words max)
- Test markers on actual model outputs
- Complete sentences (too specific)
- Common words that appear in normal responses
- Too many markers (increases false positives)
Balancing speed vs accuracy
Balancing speed vs accuracy
For faster optimization:
- Smaller datasets (100-200 examples)
- Shorter
max_response_length(50-75 tokens) - Fewer refusal markers (focus on most common ones)
- Larger datasets (500-1000 examples)
- Longer
max_response_length(150-200 tokens) - Comprehensive refusal markers
Related Configuration
Optimization Settings
Control how Heretic uses evaluation results to optimize parameters
Model Loading
Configure batch processing for evaluation
