Skip to main content
Heretic includes sophisticated hardware detection and optimization features that automatically tune performance for your system. This guide covers both automatic and manual optimization techniques.

Automatic Batch Size Detection

By default, Heretic automatically determines the optimal batch size for your hardware:
config.toml
# Automatic batch size detection (default)
batch_size = 0  # 0 = auto-detect
When set to 0, Heretic will:
1

Benchmark Different Batch Sizes

Starting from batch size 1, doubles the batch size and tests performance (2, 4, 8, 16, …)
2

Measure Throughput

For each batch size, measures tokens/second after a warmup run
3

Find Optimal Size

Selects the batch size that achieves the highest throughput before OOM
4

Use Throughout Session

Applies the chosen batch size for all subsequent operations

How It Works

The automatic detection process:
# Pseudo-code from main.py:332-376
batch_size = 1
best_batch_size = -1
best_performance = -1

while batch_size <= max_batch_size:
    try:
        # Warmup run to build computation graph
        model.get_responses(prompts)
        
        # Benchmark run
        start_time = time.perf_counter()
        responses = model.get_responses(prompts)
        end_time = time.perf_counter()
        
        # Calculate throughput
        performance = total_tokens / (end_time - start_time)
        
        if performance > best_performance:
            best_batch_size = batch_size
            best_performance = performance
            
    except Exception:
        # OOM or other error - stop here
        break
        
    batch_size *= 2
Automatic detection typically adds 1-3 minutes to startup time but ensures optimal performance throughout the entire run.

Manual Batch Size Tuning

For more control, you can set the batch size manually:
# Set explicit batch size
batch_size = 16

When to Use Manual Tuning

Reproducibility

Ensure consistent behavior across multiple runs

Shared Resources

Control memory usage on multi-user systems

Known Configuration

Skip detection when you know the optimal value

Debugging

Isolate issues by fixing batch size

Maximum Batch Size Limit

Control the upper bound for automatic detection:
config.toml
# Prevent OOM during batch size detection
max_batch_size = 128  # default
Lower this value if automatic detection causes OOM errors or takes too long.

Multi-GPU Configuration

Heretic automatically detects and utilizes multiple GPUs:
Detected 2 CUDA device(s) (49.14 GB total VRAM):
* GPU 0: NVIDIA RTX 3090 (24.57 GB)
* GPU 1: NVIDIA RTX 3090 (24.57 GB)

Device Map Strategies

Control how the model is distributed across devices:
# Automatically distribute across all devices
device_map = "auto"

Per-Device Memory Limits

Set maximum memory allocation per device:
config.toml
# Limit memory usage per device
max_memory = {"0": "20GB", "1": "20GB", "cpu": "64GB"}
This is useful for:
  • Sharing GPUs with other processes
  • Preventing a single model from consuming all VRAM
  • Forcing CPU offloading for memory-intensive layers
When using max_memory, make sure the total allocated memory is sufficient for your model. Too restrictive limits will cause loading failures.

Performance on Different Hardware

From the README, here are typical processing times:

RTX 3090 Performance

Model: Llama-3.1-8B-Instruct
Configuration: Default settings (200 trials)
Duration: ~45 minutes
This includes:
  • Model loading
  • Batch size detection
  • 200 optimization trials
  • Evaluation
Smaller models (4B-7B) typically complete in 20-40 minutes, while larger models (70B+) may take several hours even with quantization.

Duration Estimates by Model Size

Model SizeHardwareQuantizationEstimated Time
4B-7BRTX 3090No20-30 min
8B-13BRTX 3090No40-60 min
27B-34BRTX 3090Yes2-4 hours
70B+RTX 3090Yes4-8 hours
These are rough estimates for 200 trials with default settings. Actual time varies based on model architecture and prompt datasets.

Advanced Memory Optimization

Expandable Segments

Heretic automatically enables PyTorch expandable segments to reduce memory fragmentation:
# Enabled automatically in main.py:133-137
os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"
This is particularly beneficial for multi-GPU setups.

TorchDynamo Cache

The compilation cache is increased during batch size detection:
# From main.py:222
torch._dynamo.config.cache_size_limit = 64
This prevents errors from excessive recompilation during the batch size search.

Supported Accelerators

Heretic supports a wide range of hardware:
# Automatic detection
Detected 1 CUDA device(s):
* GPU 0: NVIDIA RTX 4090 (24.00 GB)
Best supported, recommended for most users.

Optimization Best Practices

1

Start with Defaults

Let automatic batch size detection find the optimal setting
batch_size = 0
device_map = "auto"
2

Enable Quantization for Large Models

Use 4-bit quantization for models >13B on consumer GPUs
quantization = "bnb_4bit"
3

Monitor Memory Usage

Watch VRAM during processing. If near capacity, reduce max_batch_size
nvidia-smi -l 1  # Monitor VRAM in real-time
4

Tune for Your Workload

If running many short sessions, fix batch size to skip detection
batch_size = 16  # from previous detection run

Configuration Examples

Single High-End GPU

config.toml
# RTX 4090 or similar (24 GB)
device_map = "auto"
batch_size = 0  # auto-detect
max_batch_size = 128
quantization = "none"  # full precision

Consumer GPU with Limited VRAM

config.toml
# RTX 3060 or similar (12 GB)
device_map = "auto"
batch_size = 0
max_batch_size = 32  # limit exploration
quantization = "bnb_4bit"  # essential for larger models

Multi-GPU Server

config.toml
# 4x GPU setup
device_map = "auto"
batch_size = 0
max_batch_size = 256  # higher limit for more VRAM
quantization = "none"

# Optional: reserve some VRAM for other tasks
# max_memory = {"0": "20GB", "1": "20GB", "2": "20GB", "3": "20GB"}

CPU Offloading

config.toml
# When model doesn't fit in VRAM
device_map = "auto"
max_memory = {"0": "20GB", "cpu": "128GB"}
batch_size = 4  # smaller for slower CPU offload
quantization = "bnb_4bit"

Troubleshooting

Out of Memory (OOM)

Symptoms: RuntimeError: CUDA out of memory Solutions:
1

Enable Quantization

quantization = "bnb_4bit"
2

Reduce Max Batch Size

max_batch_size = 32
3

Set Memory Limits

max_memory = {"0": "22GB"}  # leave 2GB headroom
4

Use CPU Offloading

max_memory = {"0": "20GB", "cpu": "64GB"}

Slow Batch Size Detection

Symptoms: Detection takes >5 minutes Solutions:
  • Lower max_batch_size to reduce search space
  • Set explicit batch_size based on previous runs
  • Use a smaller model for initial testing

Suboptimal Performance

Symptoms: Low tokens/second during processing Solutions:
  • Verify automatic detection chose a reasonable batch size
  • Check if CPU offloading is active (slow)
  • Ensure model fits entirely in VRAM
  • Monitor GPU utilization with nvidia-smi

Build docs developers (and LLMs) love