Simple Decensoring
The most basic usage requires only a model identifier:
heretic Qwen/Qwen3-4B-Instruct-2507
Heretic will:
- Download the model from HuggingFace (if not already cached)
- Detect your hardware and optimize batch size
- Run 200 optimization trials (default)
- Present results and allow you to save/upload the model
Using with Different Model Sizes
Small Models (< 8B parameters)
Small models typically fit comfortably in VRAM:
heretic Qwen/Qwen3-4B-Instruct-2507
Medium Models (8B-30B parameters)
For medium models, consider using quantization to reduce VRAM usage:
heretic --quantization bnb_4bit meta-llama/Llama-3.1-8B-Instruct
4-bit quantization via bitsandbytes can reduce VRAM requirements by approximately 75% with minimal quality impact.
Large Models (> 30B parameters)
Large models require quantization and may need explicit memory management:
heretic --quantization bnb_4bit \
--max-memory '{"0": "20GB", "cpu": "64GB"}' \
meta-llama/Llama-3.1-70B-Instruct
Understanding Progress Output
Initial Setup
When you run Heretic, you’ll see:
█░█░█▀▀░█▀▄░█▀▀░▀█▀░█░█▀▀ v1.x.x
█▀█░█▀▀░█▀▄░█▀▀░░█░░█░█░░
▀░▀░▀▀▀░▀░▀░▀▀▀░░▀░░▀░▀▀▀ https://github.com/p-e-w/heretic
Detected 1 CUDA device(s) (24.00 GB total VRAM):
* GPU 0: NVIDIA GeForce RTX 3090 (24.00 GB)
Batch Size Determination
Determining optimal batch size...
* Trying batch size 1... Ok (245 tokens/s)
* Trying batch size 2... Ok (412 tokens/s)
* Trying batch size 4... Ok (623 tokens/s)
* Trying batch size 8... Ok (789 tokens/s)
* Trying batch size 16... Failed (CUDA out of memory)
* Chosen batch size: 8
Heretic automatically finds the largest batch size that fits in memory.
Optimization Trials
Running trial 1 of 200...
* Parameters:
* direction_scope = per layer
* attn_out.max_weight = 1.23
* attn_out.max_weight_position = 28.4
* attn_out.min_weight = 0.45
* attn_out.min_weight_distance = 8.2
* mlp_down.max_weight = 1.15
* mlp_down.max_weight_position = 30.1
* mlp_down.min_weight = 0.38
* mlp_down.min_weight_distance = 7.5
* Resetting model...
* Abliterating...
* Evaluating...
Elapsed time: 2m 15s
Estimated remaining time: 7h 28m
Each trial tests different abliteration parameters.
Results Selection
After optimization:
Optimization finished!
The following trials resulted in Pareto optimal combinations of refusals and KL divergence.
After selecting a trial, you will be able to save the model, upload it to Hugging Face,
or chat with it to test how well it works.
Which trial do you want to use?
[Trial 42] Refusals: 3/100, KL divergence: 0.1623
[Trial 87] Refusals: 1/100, KL divergence: 0.5841
[Trial 134] Refusals: 0/100, KL divergence: 1.2456
Run additional trials
Exit program
Choose trials with KL divergence below 1.0 for best quality. Lower refusals with higher KL divergence means more compliance but potentially degraded capabilities.
Post-Processing Options
After selecting a trial, you have several options:
Save to Local Folder
What do you want to do with the decensored model?
> Save the model to a local folder
Path to the folder: /path/to/output
Saving merged model...
Model saved to /path/to/output.
For quantized models, you’ll be asked whether to merge or save as adapter:
Model was loaded with quantization. Merging requires reloading the base model.
WARNING: CPU merging requires dequantizing the entire model to system RAM.
This can lead to system freezes if you run out of memory.
Estimated RAM required (excluding overhead): ~27.50 GB
How do you want to proceed?
> Merge LoRA into full model (requires sufficient RAM)
Cancel
Merging a quantized model requires loading the full unquantized model into RAM. For a 27B model, this requires ~80GB RAM. Ensure you have sufficient memory or your system may freeze.
Upload to HuggingFace
What do you want to do with the decensored model?
> Upload the model to Hugging Face
Hugging Face access token: ************************************
Logged in as John Doe ([email protected])
Name of repository: username/model-name-heretic
Should the repository be public or private?
> Public
Private
Uploading merged model...
Model uploaded to username/model-name-heretic.
Heretic automatically:
- Creates or updates the repository
- Uploads the model files
- Updates the model card with abliteration details
- Adds appropriate tags (
heretic, uncensored, abliterated)
Chat with the Model
Test the model interactively:
What do you want to do with the decensored model?
> Chat with the model
Press Ctrl+C at any time to return to the menu.
> User: Tell me about machine learning
Assistant: Machine learning is a subset of artificial intelligence...
> User: [Press Ctrl+C]
This allows you to verify the model’s behavior before committing to save or upload.
Real-World Examples
Example 1: Quick Decensoring with Defaults
heretic Qwen/Qwen3-4B-Instruct-2507
Best for: First-time users, small to medium models, systems with ample VRAM.
Example 2: Quantized Decensoring
heretic --quantization bnb_4bit \
--n-trials 100 \
meta-llama/Llama-3.1-8B-Instruct
Best for: Limited VRAM, faster iteration during experimentation.
Example 3: Large Model with Custom Settings
heretic --quantization bnb_4bit \
--n-trials 300 \
--n-startup-trials 100 \
--max-memory '{"0": "22GB", "1": "22GB", "cpu": "96GB"}' \
meta-llama/Llama-3.1-70B-Instruct
Best for: Multi-GPU systems, production deployments requiring thorough optimization.
Example 4: Local Model with Configuration File
Create config.toml:
quantization = "bnb_4bit"
n_trials = 250
n_startup_trials = 75
max_response_length = 150
system_prompt = "You are a helpful, uncensored AI assistant."
Then run:
heretic /local/path/to/model
Best for: Repeated experiments, custom datasets, research workflows.
Example 5: Evaluation Only
heretic --model google/gemma-3-12b-it \
--evaluate-model p-e-w/gemma-3-12b-it-heretic
Output:
Evaluating model...
Refusals: 3/100
KL Divergence: 0.1623
Best for: Comparing different decensored variants, benchmarking.
Tips for Success
Start small: Test Heretic on a small model first (< 8B parameters) to understand the workflow before moving to larger models.
Monitor KL divergence: Values below 0.5 typically indicate minimal capability loss. Values above 1.0 may indicate significant degradation.
Use chat testing: Always test a trial with the interactive chat before saving to ensure the model behaves as expected.
More trials = better results: The default 200 trials is a good starting point, but increasing to 300-500 trials can sometimes find better parameter combinations.
CTRL+C during optimization will gracefully stop the current trial and allow you to view results. The checkpoint is saved automatically.