Skip to main content

Overview

The SigLipLoss class implements the sigmoid loss from the paper “Sigmoid Loss for Language Image Pre-Training” (SigLIP). Unlike standard CLIP’s softmax-based contrastive loss, SigLIP uses a sigmoid loss that:
  • Removes global normalization - No softmax across the batch
  • Per-pair loss computation - Each image-text pair is treated independently
  • Better scaling - More efficient for very large batch sizes
  • Improved performance - Often achieves better results than standard CLIP loss
The loss operates on pairwise similarities without requiring global batch statistics, making it particularly suitable for distributed training.

Reference

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

Class Definition

from open_clip import SigLipLoss

Initialization Parameters

cache_labels
bool
default:"False"
If True, caches ground truth labels to avoid recomputing them. Currently not actively used but reserved for future optimization.
rank
int
default:"0"
Current process rank in distributed training.
world_size
int
default:"1"
Total number of processes in distributed training. Set to 1 for single-GPU training.
dist_impl
Optional[str]
default:"None"
Distributed implementation strategy. Options:
  • "bidir" (default) - Bidirectional exchange between neighboring ranks
  • "shift" - Sequential shift pattern through all ranks
  • "reduce" - All-reduce operations
  • "gather" - All-gather operations
The "bidir" strategy is generally most efficient.

Attributes

  • cache_labels: Whether label caching is enabled
  • rank: Current process rank
  • world_size: Total number of processes
  • dist_impl: Distribution strategy being used
  • prev_num_logits: Cached logits count (for potential future optimizations)
  • labels: Dictionary for cached labels (reserved for future use)

Key Methods

forward

def forward(
    self,
    image_features: torch.Tensor,
    text_features: torch.Tensor,
    logit_scale: torch.Tensor,
    logit_bias: torch.Tensor,
    output_dict: bool = False,
) -> Union[torch.Tensor, Dict[str, torch.Tensor]]:
Computes the sigmoid contrastive loss. Parameters:
  • image_features: Normalized image features of shape (batch_size, embed_dim)
  • text_features: Normalized text features of shape (batch_size, embed_dim)
  • logit_scale: Temperature parameter (typically model.logit_scale.exp())
  • logit_bias: Bias term added to logits (SigLIP typically uses a learned bias)
  • output_dict: If True, returns dict with key “contrastive_loss”, else returns scalar
Returns: Sigmoid contrastive loss value Note: Unlike ClipLoss, the logit_bias parameter is required (not optional) for SigLIP.

get_logits

def get_logits(
    self,
    image_features: torch.Tensor,
    text_features: torch.Tensor,
    logit_scale: torch.Tensor,
    logit_bias: Optional[torch.Tensor] = None
) -> torch.Tensor:
Computes similarity logits between image and text features. Returns: Logits tensor of shape (batch_size, batch_size)

get_ground_truth

def get_ground_truth(
    self,
    device: torch.device,
    dtype: torch.dtype,
    num_logits: int,
    negative_only: bool = False
) -> torch.Tensor:
Generates ground truth labels for sigmoid loss. Parameters:
  • negative_only: If True, returns labels of all -1 (used for cross-GPU negative pairs)
Returns:
  • If negative_only=False: Matrix with +1 on diagonal, -1 elsewhere (matching pairs are positive)
  • If negative_only=True: Matrix of all -1 (all pairs are negative)

Usage Example

import torch
from open_clip import create_model, SigLipLoss

# Create model with bias (required for SigLIP)
model = create_model(
    'ViT-B-32',
    init_logit_bias=0.0  # Initialize learnable bias
)

# Create SigLIP loss
loss_fn = SigLipLoss()

# Training loop
images = torch.randn(32, 3, 224, 224)
texts = torch.randint(0, 49408, (32, 77))

# Forward pass
image_features = model.encode_image(images, normalize=True)
text_features = model.encode_text(texts, normalize=True)
logit_scale = model.logit_scale.exp()
logit_bias = model.logit_bias  # Required for SigLIP

# Compute loss
loss = loss_fn(
    image_features,
    text_features,
    logit_scale,
    logit_bias
)
loss.backward()

Distributed Training Example

import torch
import torch.distributed as dist
from open_clip import create_model, SigLipLoss

# Initialize distributed
dist.init_process_group(backend='nccl')
rank = dist.get_rank()
world_size = dist.get_world_size()

# Create model with bias
model = create_model('ViT-B-32', init_logit_bias=0.0).to(rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank])

# Create SigLIP loss with bidirectional exchange strategy
loss_fn = SigLipLoss(
    rank=rank,
    world_size=world_size,
    dist_impl='bidir'  # Most efficient for multi-GPU
)

# Training
for images, texts in dataloader:
    images = images.to(rank)
    texts = texts.to(rank)
    
    # Forward
    image_features = model.module.encode_image(images, normalize=True)
    text_features = model.module.encode_text(texts, normalize=True)
    logit_scale = model.module.logit_scale.exp()
    logit_bias = model.module.logit_bias
    
    # Loss automatically handles cross-GPU negatives
    loss = loss_fn(
        image_features,
        text_features,
        logit_scale,
        logit_bias
    )
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Comparing Distribution Strategies

import time
from open_clip import SigLipLoss

# Test different distribution implementations
strategies = ['bidir', 'shift', 'reduce', 'gather']

for strategy in strategies:
    loss_fn = SigLipLoss(
        rank=rank,
        world_size=world_size,
        dist_impl=strategy
    )
    
    start = time.time()
    for _ in range(100):
        loss = loss_fn(image_features, text_features, logit_scale, logit_bias)
        loss.backward()
    
    elapsed = time.time() - start
    print(f"{strategy}: {elapsed:.2f}s per 100 iterations")

# Typically: bidir ≈ shift < reduce < gather

Dictionary Output

loss_fn = SigLipLoss()

loss_dict = loss_fn(
    image_features,
    text_features,
    logit_scale,
    logit_bias,
    output_dict=True
)

print(loss_dict)  # {'contrastive_loss': tensor(0.45)}

Mathematical Formulation

Given normalized image features IRN×DI \in \mathbb{R}^{N \times D} and text features TRN×DT \in \mathbb{R}^{N \times D}:
  1. Compute logits: zij=τ(iitj)+bz_{ij} = \tau \cdot (i_i^\top t_j) + b where τ\tau is logit_scale and bb is logit_bias
  2. Create targets: +1 & \text{if } i = j \text{ (matching pair)} \\ -1 & \text{otherwise (negative pair)} \end{cases}$$
  3. Compute loss: L=1Ni=1Nj=1Nlogσ(yijzij)\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{N} \log\sigma(y_{ij} \cdot z_{ij}) where σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}} is the sigmoid function
  4. In distributed setting (with world_size > 1):
    • Compute local loss on diagonal pairs: (ii,ti)(i_i, t_i)
    • Exchange text features with other GPUs
    • Compute cross-GPU losses on off-diagonal pairs
    • Sum all losses

Key Differences from ClipLoss

AspectClipLossSigLipLoss
Loss functionSoftmax + Cross-EntropySigmoid + Binary Cross-Entropy
NormalizationGlobal (across batch)Local (per pair)
Target labelsClass indices [0, 1, …, N-1]Binary +1/-1 matrix
Logit biasOptionalTypically required
Batch size scalingSublinear improvementNear-linear improvement
Memory efficiencyLowerHigher
Gradient flowCoupled across batchIndependent per pair

Advantages of SigLIP

  1. Better scaling: Performance improves more consistently with larger batch sizes
  2. Memory efficient: No need to compute full batch softmax
  3. Simpler gradients: Each pair contributes independently
  4. Improved performance: Often achieves better zero-shot accuracy
  5. Distributed friendly: Natural decomposition for multi-GPU training

Distribution Strategy Guide

bidir (bidirectional exchange) - Default, recommended
  • Exchanges features with left and right neighbors simultaneously
  • Most efficient for typical multi-GPU setups
  • Requires world_size-1 exchanges in (world_size-1)/2 steps
shift - Sequential circular shift
  • Exchanges features in a ring pattern
  • Slightly slower than bidir but simpler
  • Requires world_size-1 sequential steps
reduce - All-reduce based
  • Uses all-reduce to broadcast one GPU’s features at a time
  • Less efficient but works on all hardware
  • Good fallback option
gather - All-gather based
  • Gathers all features to all GPUs
  • Most memory intensive
  • Simplest to understand

Best Practices

  1. Always use logit_bias with SigLIP:
    model = create_model('ViT-B-32', init_logit_bias=0.0)
    
  2. Use bidir strategy for distributed training:
    loss_fn = SigLipLoss(dist_impl='bidir')
    
  3. Normalize features before computing loss:
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)
    
  4. Scale batch size larger than with ClipLoss:
    • SigLIP benefits more from large batches
    • Aim for 4096+ global batch size if possible
  5. Monitor logit_bias during training:
    print(f"Logit bias: {model.logit_bias.item():.4f}")
    

Build docs developers (and LLMs) love