Model Loading Configuration

Overview

Model loading configuration controls how Heretic loads and initializes language models. These settings affect VRAM usage, loading speed, and numerical precision.

Data Types (dtypes)

The dtypes option specifies a list of PyTorch data types to try when loading model tensors. If loading with one dtype fails, Heretic automatically tries the next one in the list.

Configuration

dtypes = [
    "auto",      # Let transformers choose (usually bfloat16)
    "float16",   # Half precision (pre-Ampere GPUs)
    "bfloat16",  # Brain float (better range than float16)
    "float32",   # Full precision (requires more VRAM)
]

Available dtypes

"auto" - Automatic selection (recommended)

Lets the transformers library choose the appropriate dtype for your hardware. In practice, this almost always resolves to bfloat16 on modern GPUs (Ampere and newer).Best for: Most users, modern GPUs (RTX 30/40 series, A100, H100)

"float16" - Half precision

Uses 16-bit floating point numbers, reducing memory usage by 50% compared to float32. Works on older GPUs that don’t support bfloat16.Best for: Pre-Ampere GPUs (RTX 20 series, GTX 1000 series, V100)Caution: Has a smaller range than bfloat16, which can cause numerical issues in some models.

"bfloat16" - Brain float

Google’s 16-bit format that maintains the same range as float32 but with lower precision. More numerically stable than float16 for most models.Best for: Modern GPUs with bfloat16 support (Ampere and newer)Note: Not supported on pre-Ampere hardware.

"float32" - Full precision

Standard 32-bit floating point. Highest precision but requires double the VRAM of half-precision formats.Best for: When precision is critical and you have sufficient VRAM (rare)

The default list tries auto first, then falls back to other formats if loading fails. This works well for most scenarios, so you rarely need to change it.

Quantization

Quantization reduces model size by using lower-precision integers to represent weights. This dramatically reduces VRAM requirements but may slightly impact quality.

Configuration

# No quantization (default)
quantization = "none"

# 4-bit quantization with bitsandbytes
quantization = "bnb_4bit"

Available methods

"none" (default)

No quantization. Models are loaded in their native precision (determined by dtypes).Best for: When you have sufficient VRAM and want maximum quality

"bnb_4bit"

4-bit quantization using bitsandbytes. Reduces VRAM usage by approximately 75% compared to float16.Best for: Running large models on consumer GPUs (e.g., 70B models on 24GB VRAM)

Quantization requires the bitsandbytes library. Install it with: pip install bitsandbytes

Example: Loading a 70B model on 24GB VRAM

# Enable 4-bit quantization
quantization = "bnb_4bit"

# Let transformers handle device placement
device_map = "auto"

# Limit memory per device
max_memory = {"0": "22GB", "cpu": "32GB"}

Device Mapping

The device_map option controls how the model is distributed across available devices (GPUs, CPU).

Configuration

# Automatic device placement (recommended)
device_map = "auto"

Device map strategies

"auto" (recommended)
Manual mapping

Automatically distributes the model across available GPUs and CPU. Accelerate analyzes the model architecture and available memory to make optimal placement decisions.

device_map = "auto"

This is the recommended setting for most users.

You can manually specify which device each layer should go to by providing a dictionary mapping layer names to devices.

# Advanced: manual device mapping
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    # ... more layer mappings ...
    "model.norm": 1,
    "lm_head": 1
}

Manual device mapping requires deep knowledge of the model architecture. Use "auto" unless you have a specific reason to override.

Memory Limits

The max_memory option sets maximum memory allocation per device. This is useful for multi-GPU setups or when you want to reserve memory for other processes.

Configuration

# Limit VRAM per GPU and CPU memory
max_memory = {"0": "20GB", "1": "20GB", "cpu": "64GB"}

# Single GPU with CPU offload
max_memory = {"0": "22GB", "cpu": "32GB"}

Understanding memory limits

GPU memory: Specified by device index ("0", "1", etc.)
CPU memory: Specified as "cpu"
Units: Use "GB" (gigabytes) or "MB" (megabytes)

Leave some headroom (1-2GB) below your actual VRAM to account for activation memory during inference.

Example: Dual GPU setup

config.toml

# Distribute model across 2 GPUs with CPU offload
device_map = "auto"
max_memory = {
    "0": "22GB",  # First GPU (leave 2GB for activations)
    "1": "22GB",  # Second GPU
    "cpu": "48GB" # CPU RAM for overflow
}

Trust Remote Code

Some models on Hugging Face Hub require custom code to run. The trust_remote_code option controls whether to execute this code.

Configuration

# Allow models to run custom code
trust_remote_code = true

# Block custom code (default)
trust_remote_code = false

Only enable trust_remote_code for models from sources you trust. Custom code can potentially execute arbitrary operations on your system.

When to enable

Required for some models: Models like Microsoft Phi, Qwen, and others with custom architectures
You trust the source: Official model releases from reputable organizations
You’ve reviewed the code: Check the model’s modeling_*.py files on Hugging Face

When to keep disabled

Unknown sources: Models from unverified publishers
Security-critical environments: Production systems, shared infrastructure
Standard architectures: Most Llama, Mistral, and Gemma models don’t need it

Complete Example

Here’s a comprehensive model loading configuration for a large model on limited VRAM:

# Try bfloat16 first, fall back to float16
dtypes = ["bfloat16", "float16"]

# Enable 4-bit quantization to fit larger models
quantization = "bnb_4bit"

# Automatic device placement
device_map = "auto"

# Reserve memory for activations
max_memory = {"0": "22GB", "cpu": "32GB"}

# Enable for models that need it (e.g., Qwen)
trust_remote_code = true

Troubleshooting

Out of memory (OOM) errors

Solutions:

Enable 4-bit quantization: quantization = "bnb_4bit"
Set max_memory to leave headroom for activations
Reduce batch_size (see Optimization Configuration)
Use CPU offloading with max_memory = {"0": "20GB", "cpu": "64GB"}

Model fails to load with 'auto' dtype

Solution: Try specifying dtypes explicitly:

dtypes = ["float16", "float32"]

bfloat16 not supported on my GPU

Solution: Your GPU is pre-Ampere. Use float16 instead:

dtypes = ["float16", "float32"]

Model requires trust_remote_code

Solution: Enable it if you trust the model source:

trust_remote_code = true

Then review the model’s code on Hugging Face before proceeding.

Optimization Settings

Configure batch sizes and performance tuning

Evaluation Settings

Set up prompts and refusal detection

CLI Reference

Configuration

Advanced

Model Loading Configuration

Overview

Data Types (dtypes)

Configuration

Available dtypes

Quantization

Configuration

Available methods

"none" (default)

"bnb_4bit"

Example: Loading a 70B model on 24GB VRAM

Device Mapping

Configuration

Device map strategies

Memory Limits

Configuration

Understanding memory limits

Example: Dual GPU setup

Trust Remote Code

Configuration

When to enable

When to keep disabled

Complete Example

Troubleshooting

Optimization Settings

Evaluation Settings

Build docs developers (and LLMs) love

CLI Reference

Configuration

Advanced

​Overview

​Data Types (dtypes)

​Configuration

​Available dtypes

​Quantization

​Configuration

​Available methods

"none" (default)

"bnb_4bit"

​Example: Loading a 70B model on 24GB VRAM

​Device Mapping

​Configuration

​Device map strategies

​Memory Limits

​Configuration

​Understanding memory limits

​Example: Dual GPU setup

​Trust Remote Code

​Configuration

​When to enable

​When to keep disabled

​Complete Example

​Troubleshooting

​Related Configuration

Optimization Settings

Evaluation Settings

Build docs developers (and LLMs) love

Overview

Data Types (dtypes)

Configuration

Available dtypes

Quantization

Configuration

Available methods

Example: Loading a 70B model on 24GB VRAM

Device Mapping

Configuration

Device map strategies

Memory Limits

Configuration

Understanding memory limits

Example: Dual GPU setup

Trust Remote Code

Configuration

When to enable

When to keep disabled

Complete Example

Troubleshooting

Related Configuration