SigLipLoss

Overview

The SigLipLoss class implements the sigmoid loss from the paper “Sigmoid Loss for Language Image Pre-Training” (SigLIP). Unlike standard CLIP’s softmax-based contrastive loss, SigLIP uses a sigmoid loss that:

Removes global normalization - No softmax across the batch
Per-pair loss computation - Each image-text pair is treated independently
Better scaling - More efficient for very large batch sizes
Improved performance - Often achieves better results than standard CLIP loss

The loss operates on pairwise similarities without requiring global batch statistics, making it particularly suitable for distributed training.

Reference

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

Class Definition

from open_clip import SigLipLoss

Initialization Parameters

cache_labels

bool

default:"False"

If True, caches ground truth labels to avoid recomputing them. Currently not actively used but reserved for future optimization.

rank

int

default:"0"

Current process rank in distributed training.

world_size

int

default:"1"

Total number of processes in distributed training. Set to 1 for single-GPU training.

dist_impl

Optional[str]

default:"None"

Distributed implementation strategy. Options:

"bidir" (default) - Bidirectional exchange between neighboring ranks
"shift" - Sequential shift pattern through all ranks
"reduce" - All-reduce operations
"gather" - All-gather operations

The "bidir" strategy is generally most efficient.

Attributes

cache_labels: Whether label caching is enabled
rank: Current process rank
world_size: Total number of processes
dist_impl: Distribution strategy being used
prev_num_logits: Cached logits count (for potential future optimizations)
labels: Dictionary for cached labels (reserved for future use)

Key Methods

forward

def forward(
    self,
    image_features: torch.Tensor,
    text_features: torch.Tensor,
    logit_scale: torch.Tensor,
    logit_bias: torch.Tensor,
    output_dict: bool = False,
) -> Union[torch.Tensor, Dict[str, torch.Tensor]]:

Computes the sigmoid contrastive loss. Parameters:

image_features: Normalized image features of shape (batch_size, embed_dim)
text_features: Normalized text features of shape (batch_size, embed_dim)
logit_scale: Temperature parameter (typically model.logit_scale.exp())
logit_bias: Bias term added to logits (SigLIP typically uses a learned bias)
output_dict: If True, returns dict with key “contrastive_loss”, else returns scalar

Returns: Sigmoid contrastive loss value Note: Unlike ClipLoss, the logit_bias parameter is required (not optional) for SigLIP.

get_logits

def get_logits(
    self,
    image_features: torch.Tensor,
    text_features: torch.Tensor,
    logit_scale: torch.Tensor,
    logit_bias: Optional[torch.Tensor] = None
) -> torch.Tensor:

Computes similarity logits between image and text features. Returns: Logits tensor of shape (batch_size, batch_size)

get_ground_truth

def get_ground_truth(
    self,
    device: torch.device,
    dtype: torch.dtype,
    num_logits: int,
    negative_only: bool = False
) -> torch.Tensor:

Generates ground truth labels for sigmoid loss. Parameters:

negative_only: If True, returns labels of all -1 (used for cross-GPU negative pairs)

Returns:

If negative_only=False: Matrix with +1 on diagonal, -1 elsewhere (matching pairs are positive)
If negative_only=True: Matrix of all -1 (all pairs are negative)

Usage Example

import torch
from open_clip import create_model, SigLipLoss

# Create model with bias (required for SigLIP)
model = create_model(
    'ViT-B-32',
    init_logit_bias=0.0  # Initialize learnable bias
)

# Create SigLIP loss
loss_fn = SigLipLoss()

# Training loop
images = torch.randn(32, 3, 224, 224)
texts = torch.randint(0, 49408, (32, 77))

# Forward pass
image_features = model.encode_image(images, normalize=True)
text_features = model.encode_text(texts, normalize=True)
logit_scale = model.logit_scale.exp()
logit_bias = model.logit_bias  # Required for SigLIP

# Compute loss
loss = loss_fn(
    image_features,
    text_features,
    logit_scale,
    logit_bias
)
loss.backward()

Distributed Training Example

import torch
import torch.distributed as dist
from open_clip import create_model, SigLipLoss

# Initialize distributed
dist.init_process_group(backend='nccl')
rank = dist.get_rank()
world_size = dist.get_world_size()

# Create model with bias
model = create_model('ViT-B-32', init_logit_bias=0.0).to(rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank])

# Create SigLIP loss with bidirectional exchange strategy
loss_fn = SigLipLoss(
    rank=rank,
    world_size=world_size,
    dist_impl='bidir'  # Most efficient for multi-GPU
)

# Training
for images, texts in dataloader:
    images = images.to(rank)
    texts = texts.to(rank)
    
    # Forward
    image_features = model.module.encode_image(images, normalize=True)
    text_features = model.module.encode_text(texts, normalize=True)
    logit_scale = model.module.logit_scale.exp()
    logit_bias = model.module.logit_bias
    
    # Loss automatically handles cross-GPU negatives
    loss = loss_fn(
        image_features,
        text_features,
        logit_scale,
        logit_bias
    )
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Comparing Distribution Strategies

import time
from open_clip import SigLipLoss

# Test different distribution implementations
strategies = ['bidir', 'shift', 'reduce', 'gather']

for strategy in strategies:
    loss_fn = SigLipLoss(
        rank=rank,
        world_size=world_size,
        dist_impl=strategy
    )
    
    start = time.time()
    for _ in range(100):
        loss = loss_fn(image_features, text_features, logit_scale, logit_bias)
        loss.backward()
    
    elapsed = time.time() - start
    print(f"{strategy}: {elapsed:.2f}s per 100 iterations")

# Typically: bidir ≈ shift < reduce < gather

Dictionary Output

loss_fn = SigLipLoss()

loss_dict = loss_fn(
    image_features,
    text_features,
    logit_scale,
    logit_bias,
    output_dict=True
)

print(loss_dict)  # {'contrastive_loss': tensor(0.45)}

Mathematical Formulation

Given normalized image features

I \in \mathbb{R}^{N \times D}

and text features

T \in \mathbb{R}^{N \times D}

Compute logits: $z_{ij} = \tau \cdot (i_i^\top t_j) + b$ where $\tau$ is logit_scale and $b$ is logit_bias
Create targets: $+1 & \text{if } i = j \text{ (matching pair)} \\ -1 & \text{otherwise (negative pair)} \end{cases}$$$
Compute loss: $\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{N} \log\sigma(y_{ij} \cdot z_{ij})$ where $\sigma(x) = \frac{1}{1 + e^{-x}}$ is the sigmoid function
In distributed setting (with world_size > 1):
- Compute local loss on diagonal pairs: $(i_i, t_i)$
- Exchange text features with other GPUs
- Compute cross-GPU losses on off-diagonal pairs
- Sum all losses

Key Differences from ClipLoss

Aspect	ClipLoss	SigLipLoss
Loss function	Softmax + Cross-Entropy	Sigmoid + Binary Cross-Entropy
Normalization	Global (across batch)	Local (per pair)
Target labels	Class indices [0, 1, …, N-1]	Binary +1/-1 matrix
Logit bias	Optional	Typically required
Batch size scaling	Sublinear improvement	Near-linear improvement
Memory efficiency	Lower	Higher
Gradient flow	Coupled across batch	Independent per pair

Advantages of SigLIP

Better scaling: Performance improves more consistently with larger batch sizes
Memory efficient: No need to compute full batch softmax
Simpler gradients: Each pair contributes independently
Improved performance: Often achieves better zero-shot accuracy
Distributed friendly: Natural decomposition for multi-GPU training

Distribution Strategy Guide

bidir (bidirectional exchange) - Default, recommended

Exchanges features with left and right neighbors simultaneously
Most efficient for typical multi-GPU setups
Requires world_size-1 exchanges in (world_size-1)/2 steps

shift - Sequential circular shift

Exchanges features in a ring pattern
Slightly slower than bidir but simpler
Requires world_size-1 sequential steps

reduce - All-reduce based

Uses all-reduce to broadcast one GPU’s features at a time
Less efficient but works on all hardware
Good fallback option

gather - All-gather based

Gathers all features to all GPUs
Most memory intensive
Simplest to understand

Best Practices

Always use logit_bias with SigLIP:

model = create_model('ViT-B-32', init_logit_bias=0.0)

Use bidir strategy for distributed training:

loss_fn = SigLipLoss(dist_impl='bidir')

Normalize features before computing loss:

image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)

Scale batch size larger than with ClipLoss:
- SigLIP benefits more from large batches
- Aim for 4096+ global batch size if possible

Monitor logit_bias during training:

print(f"Logit bias: {model.logit_bias.item():.4f}")

ClipLoss - Standard CLIP contrastive loss
CLIP - Base model architecture
Training Guide - Full training examples
Paper - SigLIP paper

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

Overview

Reference

Class Definition

Initialization Parameters

Attributes

Key Methods

forward

get_logits

get_ground_truth

Usage Example

Distributed Training Example

Comparing Distribution Strategies

Dictionary Output

Mathematical Formulation

Key Differences from ClipLoss

Advantages of SigLIP

Distribution Strategy Guide

Best Practices

Build docs developers (and LLMs) love

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

​Overview

​Reference

​Class Definition

​Initialization Parameters

​Attributes

​Key Methods

​forward

​get_logits

​get_ground_truth

​Usage Example

​Distributed Training Example

​Comparing Distribution Strategies

​Dictionary Output

​Mathematical Formulation

​Key Differences from ClipLoss

​Advantages of SigLIP

​Distribution Strategy Guide

​Best Practices

​Related

Build docs developers (and LLMs) love

Overview

Reference

Class Definition

Initialization Parameters

Attributes

Key Methods

forward

get_logits

get_ground_truth

Usage Example

Distributed Training Example

Comparing Distribution Strategies

Dictionary Output

Mathematical Formulation

Key Differences from ClipLoss

Advantages of SigLIP

Distribution Strategy Guide

Best Practices

Related