Int8 Support - OpenCLIP

OpenCLIP has beta support for int8 training and inference using the bitsandbytes library. This enables faster training with lower memory usage while maintaining accuracy, particularly beneficial for large models like ViT-Huge.

Overview

Int8 training replaces standard linear layers with 8-bit quantized versions that:

Reduce memory usage for weights and activations
Accelerate matrix multiplications
Maintain numerical stability through specialized quantization schemes
Preserve accuracy with minimal degradation

For CLIP ViT-Huge models, int8 training provides approximately 10% training speedup with no accuracy loss.

Requirements

Install the bitsandbytes library:

pip install bitsandbytes

Note: bitsandbytes requires CUDA and is currently only available for NVIDIA GPUs.

Basic Usage

Enable int8 training with the --use-bnb-linear flag:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --train-data "/path/to/train_data.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32

Available Linear Layer Types

OpenCLIP supports two int8 linear layer implementations from bitsandbytes:

SwitchBackLinearGlobal

Standard 8-bit linear layer with switchback optimization:

--use-bnb-linear SwitchBackLinearGlobal

Characteristics:

Good balance of speed and memory efficiency
Recommended for most use cases
Stable gradient computation
Works well with all model sizes

SwitchBackLinearGlobalMemEfficient

Memory-optimized 8-bit linear layer:

--use-bnb-linear SwitchBackLinearGlobalMemEfficient

Characteristics:

Further reduces memory usage
Slightly slower than standard version
Best for very large models or limited memory
Useful when training huge models (ViT-H, ViT-g)

Performance Benefits

Training Speed

ViT-Huge Model:

Standard training: baseline
Int8 training: ~10% faster
Expected improvement: 1.1x speedup

Memory Usage:

Reduced weight storage (8-bit vs 16/32-bit)
Lower activation memory
Enables larger batch sizes
Can train larger models on same hardware

Accuracy

Int8 training maintains accuracy:

No significant accuracy degradation observed
Contrastive learning is robust to quantization
Zero-shot performance remains comparable
Fine-tuning results are preserved

Examples

Training ViT-B-32 with Int8

python -m open_clip_train.main \
    --train-data "/data/cc12m/train-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 8 \
    --model ViT-B-32 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --warmup 2000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --imagenet-val /data/imagenet/validation/

Training ViT-L-14 with Int8

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 8 \
    --model ViT-L-14 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --warmup 2000 \
    --lr 1e-3 \
    --epochs 32

Training ViT-H-14 with Memory-Efficient Int8

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data "/data/laion2b/train-{00000..20000}.tar" \
    --train-num-samples 2000000000 \
    --dataset-type webdataset \
    --batch-size 128 \
    --precision amp \
    --workers 8 \
    --model ViT-H-14 \
    --use-bnb-linear SwitchBackLinearGlobalMemEfficient \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --accum-freq 2 \
    --warmup 2000 \
    --lr 5e-4 \
    --epochs 32

Combining with Other Optimizations

Int8 training works well with other memory and speed optimizations:

With Mixed Precision

python -m open_clip_train.main \
    --model ViT-L-14 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --precision amp \
    --train-data "/data/train.tar" \
    --batch-size 256

With Gradient Checkpointing

python -m open_clip_train.main \
    --model ViT-H-14 \
    --use-bnb-linear SwitchBackLinearGlobalMemEfficient \
    --grad-checkpointing \
    --precision amp \
    --train-data "/data/train.tar" \
    --batch-size 128

With Gradient Accumulation

python -m open_clip_train.main \
    --model ViT-H-14 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --accum-freq 4 \
    --batch-size 64 \
    --precision amp \
    --grad-checkpointing \
    --train-data "/data/train.tar"

With Distributed Training

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-L-14 \
    --use-bnb-linear SwitchBackLinearGlobal \
    --precision amp \
    --local-loss \
    --gather-with-grad \
    --train-data "/data/train.tar" \
    --batch-size 256

Int8 Inference

You can also load and use int8 models for inference:

import torch
import open_clip
from PIL import Image

# Create model with int8 layers (requires bitsandbytes)
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'
)

# Replace linear layers with int8 versions
import bitsandbytes as bnb
def replace_linear_with_int8(module):
    for name, child in module.named_children():
        if isinstance(child, torch.nn.Linear):
            setattr(module, name, 
                    bnb.nn.triton_based_modules.SwitchBackLinearGlobal(
                        child.in_features,
                        child.out_features,
                        bias=child.bias is not None
                    ))
        else:
            replace_linear_with_int8(child)

replace_linear_with_int8(model)
model.eval()

# Use model for inference
image = preprocess(Image.open("image.jpg")).unsqueeze(0)
text = open_clip.tokenize(["a photo of a cat", "a photo of a dog"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    print(similarity)

Tutorial Notebook

For a detailed walkthrough of int8 training and inference, see the tutorial notebook:

tutorials/int8_tutorial.ipynb

The notebook covers:

Setting up int8 training
Comparing performance with standard training
Memory usage analysis
Accuracy evaluation
Inference optimization
Best practices

Current Limitations

Attention Layers

Currently, only linear layers are replaced with int8 versions. Attention layers still use standard precision. Future improvements will include:

Int8 attention layers (coming soon)
Further speedups when attention is refactored
Full model quantization

Platform Support

Supported: NVIDIA GPUs with CUDA
Not Supported: CPU, AMD GPUs, Apple Silicon
Requires CUDA-compatible bitsandbytes installation

Optimizer State

Optimizer states (Adam, AdamW) still use higher precision:

Int8 only applies to model weights
Gradients are computed in higher precision
Optimizer momentum and variance use fp32

When to Use Int8

Recommended For:

Large Models
- ViT-Huge and larger
- Models that are close to memory limits
- When you want to increase batch size
Limited GPU Memory
- Training on consumer GPUs (RTX 3090, 4090)
- Maximizing model size on available hardware
- Enabling larger experiments
Speed-Critical Training
- When 10% speedup matters
- Large-scale training runs
- Cost-sensitive training

Not Necessary For:

Small Models (ViT-B-32, ResNet-50)
- Limited benefit for smaller models
- Standard training is already fast enough
Abundant Memory
- If memory is not a constraint
- When using small batch sizes
Maximum Precision Needed
- Research requiring exact reproducibility
- When numerical precision is critical

Best Practices

Start with SwitchBackLinearGlobal
- Good default choice for most use cases
- Balance of speed and memory
Use with Mixed Precision
- Combine --use-bnb-linear with --precision amp
- Maximizes speed benefits
Monitor Accuracy
- Run regular zero-shot evaluations
- Compare with baseline runs
- Check final model performance
Test Before Large Runs
- Validate int8 training on small dataset first
- Ensure stability and convergence
- Measure actual speedup on your hardware
Enable for Large Models
- Most beneficial for ViT-L and larger
- Use SwitchBackLinearGlobalMemEfficient for ViT-H/ViT-g

Troubleshooting

Import Error

ImportError: cannot import name 'SwitchBackLinearGlobal' from 'bitsandbytes'

Solution: Install or update bitsandbytes:

pip install --upgrade bitsandbytes

CUDA Error

RuntimeError: CUDA error: invalid device function

Solution: Ensure bitsandbytes is installed with correct CUDA version:

pip uninstall bitsandbytes
pip install bitsandbytes --no-cache-dir

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Overview

​Requirements

​Basic Usage

​Available Linear Layer Types

​SwitchBackLinearGlobal

​SwitchBackLinearGlobalMemEfficient

​Performance Benefits

​Training Speed

​Accuracy

​Examples

​Training ViT-B-32 with Int8

​Training ViT-L-14 with Int8

​Training ViT-H-14 with Memory-Efficient Int8

​Combining with Other Optimizations

​With Mixed Precision

​With Gradient Checkpointing

​With Gradient Accumulation

​With Distributed Training

​Int8 Inference

​Tutorial Notebook

​Current Limitations

​Attention Layers

​Platform Support

​Optimizer State

​When to Use Int8

​Recommended For:

​Not Necessary For:

​Best Practices

​Troubleshooting

​Import Error

​CUDA Error

​Slower Than Expected

​Numerical Issues

​References

Build docs developers (and LLMs) love