Overview
Model loading configuration controls how Heretic loads and initializes language models. These settings affect VRAM usage, loading speed, and numerical precision.Data Types (dtypes)
Thedtypes option specifies a list of PyTorch data types to try when loading model tensors. If loading with one dtype fails, Heretic automatically tries the next one in the list.
Configuration
Available dtypes
"auto" - Automatic selection (recommended)
"auto" - Automatic selection (recommended)
Lets the transformers library choose the appropriate dtype for your hardware. In practice, this almost always resolves to
bfloat16 on modern GPUs (Ampere and newer).Best for: Most users, modern GPUs (RTX 30/40 series, A100, H100)"float16" - Half precision
"float16" - Half precision
Uses 16-bit floating point numbers, reducing memory usage by 50% compared to float32. Works on older GPUs that don’t support bfloat16.Best for: Pre-Ampere GPUs (RTX 20 series, GTX 1000 series, V100)Caution: Has a smaller range than bfloat16, which can cause numerical issues in some models.
"bfloat16" - Brain float
"bfloat16" - Brain float
Google’s 16-bit format that maintains the same range as float32 but with lower precision. More numerically stable than float16 for most models.Best for: Modern GPUs with bfloat16 support (Ampere and newer)Note: Not supported on pre-Ampere hardware.
"float32" - Full precision
"float32" - Full precision
Standard 32-bit floating point. Highest precision but requires double the VRAM of half-precision formats.Best for: When precision is critical and you have sufficient VRAM (rare)
Quantization
Quantization reduces model size by using lower-precision integers to represent weights. This dramatically reduces VRAM requirements but may slightly impact quality.Configuration
Available methods
"none" (default)
No quantization. Models are loaded in their native precision (determined by dtypes).Best for: When you have sufficient VRAM and want maximum quality
"bnb_4bit"
4-bit quantization using bitsandbytes. Reduces VRAM usage by approximately 75% compared to float16.Best for: Running large models on consumer GPUs (e.g., 70B models on 24GB VRAM)
Quantization requires the
bitsandbytes library. Install it with: pip install bitsandbytesExample: Loading a 70B model on 24GB VRAM
Device Mapping
Thedevice_map option controls how the model is distributed across available devices (GPUs, CPU).
Configuration
Device map strategies
- "auto" (recommended)
- Manual mapping
Automatically distributes the model across available GPUs and CPU. Accelerate analyzes the model architecture and available memory to make optimal placement decisions.This is the recommended setting for most users.
Memory Limits
Themax_memory option sets maximum memory allocation per device. This is useful for multi-GPU setups or when you want to reserve memory for other processes.
Configuration
Understanding memory limits
- GPU memory: Specified by device index (
"0","1", etc.) - CPU memory: Specified as
"cpu" - Units: Use
"GB"(gigabytes) or"MB"(megabytes)
Example: Dual GPU setup
config.toml
Trust Remote Code
Some models on Hugging Face Hub require custom code to run. Thetrust_remote_code option controls whether to execute this code.
Configuration
When to enable
- Required for some models: Models like Microsoft Phi, Qwen, and others with custom architectures
- You trust the source: Official model releases from reputable organizations
- You’ve reviewed the code: Check the model’s
modeling_*.pyfiles on Hugging Face
When to keep disabled
- Unknown sources: Models from unverified publishers
- Security-critical environments: Production systems, shared infrastructure
- Standard architectures: Most Llama, Mistral, and Gemma models don’t need it
Complete Example
Here’s a comprehensive model loading configuration for a large model on limited VRAM:Troubleshooting
Out of memory (OOM) errors
Out of memory (OOM) errors
Solutions:
- Enable 4-bit quantization:
quantization = "bnb_4bit" - Set
max_memoryto leave headroom for activations - Reduce
batch_size(see Optimization Configuration) - Use CPU offloading with
max_memory = {"0": "20GB", "cpu": "64GB"}
Model fails to load with 'auto' dtype
Model fails to load with 'auto' dtype
Solution:
Try specifying dtypes explicitly:
bfloat16 not supported on my GPU
bfloat16 not supported on my GPU
Solution:
Your GPU is pre-Ampere. Use float16 instead:
Model requires trust_remote_code
Model requires trust_remote_code
Solution:
Enable it if you trust the model source:Then review the model’s code on Hugging Face before proceeding.
Related Configuration
Optimization Settings
Configure batch sizes and performance tuning
Evaluation Settings
Set up prompts and refusal detection
