Heretic includes sophisticated hardware detection and optimization features that automatically tune performance for your system. This guide covers both automatic and manual optimization techniques.
Automatic Batch Size Detection
By default, Heretic automatically determines the optimal batch size for your hardware:
# Automatic batch size detection (default)
batch_size = 0 # 0 = auto-detect
When set to 0, Heretic will:
Benchmark Different Batch Sizes
Starting from batch size 1, doubles the batch size and tests performance (2, 4, 8, 16, …)
Measure Throughput
For each batch size, measures tokens/second after a warmup run
Find Optimal Size
Selects the batch size that achieves the highest throughput before OOM
Use Throughout Session
Applies the chosen batch size for all subsequent operations
How It Works
The automatic detection process:
# Pseudo-code from main.py:332-376
batch_size = 1
best_batch_size = - 1
best_performance = - 1
while batch_size <= max_batch_size:
try :
# Warmup run to build computation graph
model.get_responses(prompts)
# Benchmark run
start_time = time.perf_counter()
responses = model.get_responses(prompts)
end_time = time.perf_counter()
# Calculate throughput
performance = total_tokens / (end_time - start_time)
if performance > best_performance:
best_batch_size = batch_size
best_performance = performance
except Exception :
# OOM or other error - stop here
break
batch_size *= 2
Automatic detection typically adds 1-3 minutes to startup time but ensures optimal performance throughout the entire run.
Manual Batch Size Tuning
For more control, you can set the batch size manually:
# Set explicit batch size
batch_size = 16
When to Use Manual Tuning
Reproducibility Ensure consistent behavior across multiple runs
Shared Resources Control memory usage on multi-user systems
Known Configuration Skip detection when you know the optimal value
Debugging Isolate issues by fixing batch size
Maximum Batch Size Limit
Control the upper bound for automatic detection:
# Prevent OOM during batch size detection
max_batch_size = 128 # default
Lower this value if automatic detection causes OOM errors or takes too long.
Multi-GPU Configuration
Heretic automatically detects and utilizes multiple GPUs:
Detected 2 CUDA device(s) (49.14 GB total VRAM):
* GPU 0: NVIDIA RTX 3090 (24.57 GB)
* GPU 1: NVIDIA RTX 3090 (24.57 GB)
Device Map Strategies
Control how the model is distributed across devices:
Automatic (Default)
Specific GPU
Custom Distribution
# Automatically distribute across all devices
device_map = "auto"
Per-Device Memory Limits
Set maximum memory allocation per device:
# Limit memory usage per device
max_memory = {"0": "20GB", "1": "20GB", "cpu": "64GB"}
This is useful for:
Sharing GPUs with other processes
Preventing a single model from consuming all VRAM
Forcing CPU offloading for memory-intensive layers
When using max_memory, make sure the total allocated memory is sufficient for your model. Too restrictive limits will cause loading failures.
From the README, here are typical processing times:
Model: Llama-3.1-8B-Instruct
Configuration: Default settings (200 trials)
Duration: ~45 minutes
This includes:
Model loading
Batch size detection
200 optimization trials
Evaluation
Smaller models (4B-7B) typically complete in 20-40 minutes, while larger models (70B+) may take several hours even with quantization.
Duration Estimates by Model Size
Model Size Hardware Quantization Estimated Time 4B-7B RTX 3090 No 20-30 min 8B-13B RTX 3090 No 40-60 min 27B-34B RTX 3090 Yes 2-4 hours 70B+ RTX 3090 Yes 4-8 hours
These are rough estimates for 200 trials with default settings. Actual time varies based on model architecture and prompt datasets.
Advanced Memory Optimization
Expandable Segments
Heretic automatically enables PyTorch expandable segments to reduce memory fragmentation:
# Enabled automatically in main.py:133-137
os.environ[ "PYTORCH_ALLOC_CONF" ] = "expandable_segments:True"
This is particularly beneficial for multi-GPU setups.
TorchDynamo Cache
The compilation cache is increased during batch size detection:
# From main.py:222
torch._dynamo.config.cache_size_limit = 64
This prevents errors from excessive recompilation during the batch size search.
Supported Accelerators
Heretic supports a wide range of hardware:
NVIDIA CUDA
Intel XPU
Apple Metal (MPS)
Other Accelerators
# Automatic detection
Detected 1 CUDA device(s):
* GPU 0 : NVIDIA RTX 4090 ( 24.00 GB )
Best supported , recommended for most users.# Automatic detection
Detected 1 XPU device(s):
* XPU 0 : Intel Data Center GPU Max 1550
For Intel Data Center GPUs. # Automatic detection
Detected 1 MPS device (Apple Metal)
For Apple Silicon Macs. Heretic also supports:
MLU (Cambricon)
SDAA (SambaNova)
MUSA (Moore Threads)
NPU (Ascend CANN)
Detection is automatic based on available hardware.
Optimization Best Practices
Start with Defaults
Let automatic batch size detection find the optimal setting batch_size = 0
device_map = "auto"
Enable Quantization for Large Models
Use 4-bit quantization for models >13B on consumer GPUs quantization = "bnb_4bit"
Monitor Memory Usage
Watch VRAM during processing. If near capacity, reduce max_batch_size nvidia-smi -l 1 # Monitor VRAM in real-time
Tune for Your Workload
If running many short sessions, fix batch size to skip detection batch_size = 16 # from previous detection run
Configuration Examples
Single High-End GPU
# RTX 4090 or similar (24 GB)
device_map = "auto"
batch_size = 0 # auto-detect
max_batch_size = 128
quantization = "none" # full precision
Consumer GPU with Limited VRAM
# RTX 3060 or similar (12 GB)
device_map = "auto"
batch_size = 0
max_batch_size = 32 # limit exploration
quantization = "bnb_4bit" # essential for larger models
Multi-GPU Server
# 4x GPU setup
device_map = "auto"
batch_size = 0
max_batch_size = 256 # higher limit for more VRAM
quantization = "none"
# Optional: reserve some VRAM for other tasks
# max_memory = {"0": "20GB", "1": "20GB", "2": "20GB", "3": "20GB"}
CPU Offloading
# When model doesn't fit in VRAM
device_map = "auto"
max_memory = {"0": "20GB", "cpu": "128GB"}
batch_size = 4 # smaller for slower CPU offload
quantization = "bnb_4bit"
Troubleshooting
Out of Memory (OOM)
Symptoms: RuntimeError: CUDA out of memory
Solutions:
Enable Quantization
quantization = "bnb_4bit"
Set Memory Limits
max_memory = {"0": "22GB"} # leave 2GB headroom
Use CPU Offloading
max_memory = {"0": "20GB", "cpu": "64GB"}
Slow Batch Size Detection
Symptoms: Detection takes >5 minutes
Solutions:
Lower max_batch_size to reduce search space
Set explicit batch_size based on previous runs
Use a smaller model for initial testing
Symptoms: Low tokens/second during processing
Solutions:
Verify automatic detection chose a reasonable batch size
Check if CPU offloading is active (slow)
Ensure model fits entirely in VRAM
Monitor GPU utilization with nvidia-smi