Skip to main content
OpenCLIP has beta support for int8 training and inference using the bitsandbytes library. This enables faster training with lower memory usage while maintaining accuracy, particularly beneficial for large models like ViT-Huge.

Overview

Int8 training replaces standard linear layers with 8-bit quantized versions that:
  • Reduce memory usage for weights and activations
  • Accelerate matrix multiplications
  • Maintain numerical stability through specialized quantization schemes
  • Preserve accuracy with minimal degradation
For CLIP ViT-Huge models, int8 training provides approximately 10% training speedup with no accuracy loss.

Requirements

Install the bitsandbytes library:
pip install bitsandbytes
Note: bitsandbytes requires CUDA and is currently only available for NVIDIA GPUs.

Basic Usage

Enable int8 training with the --use-bnb-linear flag:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --train-data "/path/to/train_data.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32

Available Linear Layer Types

OpenCLIP supports two int8 linear layer implementations from bitsandbytes:

SwitchBackLinearGlobal

Standard 8-bit linear layer with switchback optimization:
--use-bnb-linear SwitchBackLinearGlobal
Characteristics:
  • Good balance of speed and memory efficiency
  • Recommended for most use cases
  • Stable gradient computation
  • Works well with all model sizes

SwitchBackLinearGlobalMemEfficient

Memory-optimized 8-bit linear layer:
--use-bnb-linear SwitchBackLinearGlobalMemEfficient
Characteristics:
  • Further reduces memory usage
  • Slightly slower than standard version
  • Best for very large models or limited memory
  • Useful when training huge models (ViT-H, ViT-g)

Performance Benefits

Training Speed

ViT-Huge Model:
  • Standard training: baseline
  • Int8 training: ~10% faster
  • Expected improvement: 1.1x speedup
Memory Usage:
  • Reduced weight storage (8-bit vs 16/32-bit)
  • Lower activation memory
  • Enables larger batch sizes
  • Can train larger models on same hardware

Accuracy

Int8 training maintains accuracy:
  • No significant accuracy degradation observed
  • Contrastive learning is robust to quantization
  • Zero-shot performance remains comparable
  • Fine-tuning results are preserved

Examples

Training ViT-B-32 with Int8

python -m open_clip_train.main \
    --train-data "/data/cc12m/train-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 8 \
    --model ViT-B-32 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --warmup 2000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --imagenet-val /data/imagenet/validation/

Training ViT-L-14 with Int8

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 8 \
    --model ViT-L-14 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --warmup 2000 \
    --lr 1e-3 \
    --epochs 32

Training ViT-H-14 with Memory-Efficient Int8

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data "/data/laion2b/train-{00000..20000}.tar" \
    --train-num-samples 2000000000 \
    --dataset-type webdataset \
    --batch-size 128 \
    --precision amp \
    --workers 8 \
    --model ViT-H-14 \
    --use-bnb-linear SwitchBackLinearGlobalMemEfficient \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --accum-freq 2 \
    --warmup 2000 \
    --lr 5e-4 \
    --epochs 32

Combining with Other Optimizations

Int8 training works well with other memory and speed optimizations:

With Mixed Precision

python -m open_clip_train.main \
    --model ViT-L-14 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --precision amp \
    --train-data "/data/train.tar" \
    --batch-size 256

With Gradient Checkpointing

python -m open_clip_train.main \
    --model ViT-H-14 \
    --use-bnb-linear SwitchBackLinearGlobalMemEfficient \
    --grad-checkpointing \
    --precision amp \
    --train-data "/data/train.tar" \
    --batch-size 128

With Gradient Accumulation

python -m open_clip_train.main \
    --model ViT-H-14 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --accum-freq 4 \
    --batch-size 64 \
    --precision amp \
    --grad-checkpointing \
    --train-data "/data/train.tar"

With Distributed Training

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-L-14 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --precision amp \
    --local-loss \
    --gather-with-grad \
    --train-data "/data/train.tar" \
    --batch-size 256

Int8 Inference

You can also load and use int8 models for inference:
import torch
import open_clip
from PIL import Image

# Create model with int8 layers (requires bitsandbytes)
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'
)

# Replace linear layers with int8 versions
import bitsandbytes as bnb
def replace_linear_with_int8(module):
    for name, child in module.named_children():
        if isinstance(child, torch.nn.Linear):
            setattr(module, name, 
                    bnb.nn.triton_based_modules.SwitchBackLinearGlobal(
                        child.in_features,
                        child.out_features,
                        bias=child.bias is not None
                    ))
        else:
            replace_linear_with_int8(child)

replace_linear_with_int8(model)
model.eval()

# Use model for inference
image = preprocess(Image.open("image.jpg")).unsqueeze(0)
text = open_clip.tokenize(["a photo of a cat", "a photo of a dog"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    print(similarity)

Tutorial Notebook

For a detailed walkthrough of int8 training and inference, see the tutorial notebook:
tutorials/int8_tutorial.ipynb
The notebook covers:
  • Setting up int8 training
  • Comparing performance with standard training
  • Memory usage analysis
  • Accuracy evaluation
  • Inference optimization
  • Best practices

Current Limitations

Attention Layers

Currently, only linear layers are replaced with int8 versions. Attention layers still use standard precision. Future improvements will include:
  • Int8 attention layers (coming soon)
  • Further speedups when attention is refactored
  • Full model quantization

Platform Support

  • Supported: NVIDIA GPUs with CUDA
  • Not Supported: CPU, AMD GPUs, Apple Silicon
  • Requires CUDA-compatible bitsandbytes installation

Optimizer State

Optimizer states (Adam, AdamW) still use higher precision:
  • Int8 only applies to model weights
  • Gradients are computed in higher precision
  • Optimizer momentum and variance use fp32

When to Use Int8

  1. Large Models
    • ViT-Huge and larger
    • Models that are close to memory limits
    • When you want to increase batch size
  2. Limited GPU Memory
    • Training on consumer GPUs (RTX 3090, 4090)
    • Maximizing model size on available hardware
    • Enabling larger experiments
  3. Speed-Critical Training
    • When 10% speedup matters
    • Large-scale training runs
    • Cost-sensitive training

Not Necessary For:

  1. Small Models (ViT-B-32, ResNet-50)
    • Limited benefit for smaller models
    • Standard training is already fast enough
  2. Abundant Memory
    • If memory is not a constraint
    • When using small batch sizes
  3. Maximum Precision Needed
    • Research requiring exact reproducibility
    • When numerical precision is critical

Best Practices

  1. Start with SwitchBackLinearGlobal
    • Good default choice for most use cases
    • Balance of speed and memory
  2. Use with Mixed Precision
    • Combine --use-bnb-linear with --precision amp
    • Maximizes speed benefits
  3. Monitor Accuracy
    • Run regular zero-shot evaluations
    • Compare with baseline runs
    • Check final model performance
  4. Test Before Large Runs
    • Validate int8 training on small dataset first
    • Ensure stability and convergence
    • Measure actual speedup on your hardware
  5. Enable for Large Models
    • Most beneficial for ViT-L and larger
    • Use SwitchBackLinearGlobalMemEfficient for ViT-H/ViT-g

Troubleshooting

Import Error

ImportError: cannot import name 'SwitchBackLinearGlobal' from 'bitsandbytes'
Solution: Install or update bitsandbytes:
pip install --upgrade bitsandbytes

CUDA Error

RuntimeError: CUDA error: invalid device function
Solution: Ensure bitsandbytes is installed with correct CUDA version:
pip uninstall bitsandbytes
pip install bitsandbytes --no-cache-dir

Slower Than Expected

  • Ensure CUDA is properly installed
  • Check GPU utilization (should be high)
  • Verify mixed precision is enabled (--precision amp)
  • Some models benefit more than others

Numerical Issues

  • Increase warmup: --warmup 5000
  • Reduce learning rate: --lr 5e-4
  • Enable gradient clipping: --grad-clip-norm 1.0
  • Try SwitchBackLinearGlobal instead of MemEfficient version

References

Build docs developers (and LLMs) love