Skip to main content

Overview

Model loading configuration controls how Heretic loads and initializes language models. These settings affect VRAM usage, loading speed, and numerical precision.

Data Types (dtypes)

The dtypes option specifies a list of PyTorch data types to try when loading model tensors. If loading with one dtype fails, Heretic automatically tries the next one in the list.

Configuration

dtypes = [
    "auto",      # Let transformers choose (usually bfloat16)
    "float16",   # Half precision (pre-Ampere GPUs)
    "bfloat16",  # Brain float (better range than float16)
    "float32",   # Full precision (requires more VRAM)
]

Available dtypes

Uses 16-bit floating point numbers, reducing memory usage by 50% compared to float32. Works on older GPUs that don’t support bfloat16.Best for: Pre-Ampere GPUs (RTX 20 series, GTX 1000 series, V100)Caution: Has a smaller range than bfloat16, which can cause numerical issues in some models.
Google’s 16-bit format that maintains the same range as float32 but with lower precision. More numerically stable than float16 for most models.Best for: Modern GPUs with bfloat16 support (Ampere and newer)Note: Not supported on pre-Ampere hardware.
Standard 32-bit floating point. Highest precision but requires double the VRAM of half-precision formats.Best for: When precision is critical and you have sufficient VRAM (rare)
The default list tries auto first, then falls back to other formats if loading fails. This works well for most scenarios, so you rarely need to change it.

Quantization

Quantization reduces model size by using lower-precision integers to represent weights. This dramatically reduces VRAM requirements but may slightly impact quality.

Configuration

# No quantization (default)
quantization = "none"

# 4-bit quantization with bitsandbytes
quantization = "bnb_4bit"

Available methods

"none" (default)

No quantization. Models are loaded in their native precision (determined by dtypes).Best for: When you have sufficient VRAM and want maximum quality

"bnb_4bit"

4-bit quantization using bitsandbytes. Reduces VRAM usage by approximately 75% compared to float16.Best for: Running large models on consumer GPUs (e.g., 70B models on 24GB VRAM)
Quantization requires the bitsandbytes library. Install it with: pip install bitsandbytes

Example: Loading a 70B model on 24GB VRAM

# Enable 4-bit quantization
quantization = "bnb_4bit"

# Let transformers handle device placement
device_map = "auto"

# Limit memory per device
max_memory = {"0": "22GB", "cpu": "32GB"}

Device Mapping

The device_map option controls how the model is distributed across available devices (GPUs, CPU).

Configuration

# Automatic device placement (recommended)
device_map = "auto"

Device map strategies

Memory Limits

The max_memory option sets maximum memory allocation per device. This is useful for multi-GPU setups or when you want to reserve memory for other processes.

Configuration

# Limit VRAM per GPU and CPU memory
max_memory = {"0": "20GB", "1": "20GB", "cpu": "64GB"}

# Single GPU with CPU offload
max_memory = {"0": "22GB", "cpu": "32GB"}

Understanding memory limits

  • GPU memory: Specified by device index ("0", "1", etc.)
  • CPU memory: Specified as "cpu"
  • Units: Use "GB" (gigabytes) or "MB" (megabytes)
Leave some headroom (1-2GB) below your actual VRAM to account for activation memory during inference.

Example: Dual GPU setup

config.toml
# Distribute model across 2 GPUs with CPU offload
device_map = "auto"
max_memory = {
    "0": "22GB",  # First GPU (leave 2GB for activations)
    "1": "22GB",  # Second GPU
    "cpu": "48GB" # CPU RAM for overflow
}

Trust Remote Code

Some models on Hugging Face Hub require custom code to run. The trust_remote_code option controls whether to execute this code.

Configuration

# Allow models to run custom code
trust_remote_code = true

# Block custom code (default)
trust_remote_code = false
Only enable trust_remote_code for models from sources you trust. Custom code can potentially execute arbitrary operations on your system.

When to enable

  • Required for some models: Models like Microsoft Phi, Qwen, and others with custom architectures
  • You trust the source: Official model releases from reputable organizations
  • You’ve reviewed the code: Check the model’s modeling_*.py files on Hugging Face

When to keep disabled

  • Unknown sources: Models from unverified publishers
  • Security-critical environments: Production systems, shared infrastructure
  • Standard architectures: Most Llama, Mistral, and Gemma models don’t need it

Complete Example

Here’s a comprehensive model loading configuration for a large model on limited VRAM:
# Try bfloat16 first, fall back to float16
dtypes = ["bfloat16", "float16"]

# Enable 4-bit quantization to fit larger models
quantization = "bnb_4bit"

# Automatic device placement
device_map = "auto"

# Reserve memory for activations
max_memory = {"0": "22GB", "cpu": "32GB"}

# Enable for models that need it (e.g., Qwen)
trust_remote_code = true

Troubleshooting

Solutions:
  1. Enable 4-bit quantization: quantization = "bnb_4bit"
  2. Set max_memory to leave headroom for activations
  3. Reduce batch_size (see Optimization Configuration)
  4. Use CPU offloading with max_memory = {"0": "20GB", "cpu": "64GB"}
Solution: Try specifying dtypes explicitly:
dtypes = ["float16", "float32"]
Solution: Your GPU is pre-Ampere. Use float16 instead:
dtypes = ["float16", "float32"]
Solution: Enable it if you trust the model source:
trust_remote_code = true
Then review the model’s code on Hugging Face before proceeding.

Optimization Settings

Configure batch sizes and performance tuning

Evaluation Settings

Set up prompts and refusal detection

Build docs developers (and LLMs) love