What is Quantization?
Quantization reduces the precision of model weights from their original format (typically bfloat16 or float32) to lower bit representations. This significantly reduces memory requirements while maintaining acceptable model performance. Heretic uses bitsandbytes 4-bit quantization (NF4) with double quantization, which provides an excellent balance between memory savings and model quality.Why Use Quantization?
Reduced VRAM
Process models up to 4x larger on the same hardware
Faster Experimentation
Run more trials with limited GPU resources
Same Quality
Produces comparable decensored models to full precision
Cost Savings
Use smaller, cheaper GPU instances for processing
Enabling Quantization
Quantization can be enabled via configuration file or command line:Memory Requirements
Without Quantization
Full precision models require approximately:- bfloat16/float16: ~2 bytes per parameter
- Rule of thumb: Parameter count (in billions) × 2 = GB of VRAM needed
- 7B model: ~14 GB VRAM
- 13B model: ~26 GB VRAM
- 70B model: ~140 GB VRAM
With 4-bit Quantization
Quantized models require approximately:- 4-bit NF4: ~0.5-0.6 bytes per parameter
- VRAM savings: Up to 4x reduction
- 7B model: ~4 GB VRAM (instead of 14 GB)
- 13B model: ~7 GB VRAM (instead of 26 GB)
- 70B model: ~40 GB VRAM (instead of 140 GB)
The exact memory usage will vary based on model architecture, batch size, and sequence length. These are approximate values for the model weights only.
Implementation Details
Heretic uses the following quantization configuration:load_in_4bit=True: Loads model weights in 4-bit precisionbnb_4bit_compute_dtype: Computation dtype (matched to your hardware)bnb_4bit_quant_type="nf4": Normal Float 4-bit quantizationbnb_4bit_use_double_quant=True: Quantizes the quantization constants for additional savings
Performance Impact
Quantization affects both speed and quality:Processing Speed
- Inference: Slightly slower due to dequantization overhead
- Loading: Significantly faster due to smaller model size
- Overall: The VRAM savings often enable larger batch sizes, improving throughput
Model Quality
- Decensored models produced with quantization are comparable to full precision
- KL divergence and refusal metrics remain similar
- The optimization process accounts for any quantization effects
Hardware Considerations
Merging Quantized Models
If you load a model with quantization enabled, merging requires reloading the base model in full precision:RAM Requirements for Merging
When merging a quantized model, you need sufficient system RAM (not VRAM):- Estimated RAM needed: ~3x the parameter count in GB
- Example: A 27B model requires ~80 GB RAM
- Example: A 70B model requires ~200 GB RAM
- Loads the base model on CPU in full precision
- Applies the LoRA adapters
- Merges and saves the final model
Example Workflow
Process a large model on a consumer GPU:Best Practices
Start with Quantization
For models >13B, enable quantization by default on consumer GPUs
Monitor Memory
Watch VRAM usage during batch size detection to optimize throughput
Plan for Merging
Ensure adequate system RAM if you plan to merge the final model
Test Quality
Compare quantized vs full precision results on smaller models first
Troubleshooting
Out of Memory During Loading
If the model still doesn’t fit with quantization:- Reduce
max_batch_sizeto limit memory for batch size detection - Use
max_memoryto restrict allocation per device - Consider offloading to CPU with
device_map = "auto"
Slow Performance
If quantized inference is too slow:- Increase batch size (quantization leaves more VRAM available)
- Ensure bitsandbytes is properly installed with CUDA support
- Check that compute dtype matches your hardware (bfloat16 for Ampere+)
Merge Failures
If merging fails due to insufficient RAM:Related Configuration
Other memory optimization options that work well with quantization:config.toml
