Overview
TheSigLipLoss class implements the sigmoid loss from the paper “Sigmoid Loss for Language Image Pre-Training” (SigLIP). Unlike standard CLIP’s softmax-based contrastive loss, SigLIP uses a sigmoid loss that:
- Removes global normalization - No softmax across the batch
- Per-pair loss computation - Each image-text pair is treated independently
- Better scaling - More efficient for very large batch sizes
- Improved performance - Often achieves better results than standard CLIP loss
Reference
Class Definition
Initialization Parameters
If True, caches ground truth labels to avoid recomputing them. Currently not actively used but reserved for future optimization.
Current process rank in distributed training.
Total number of processes in distributed training. Set to 1 for single-GPU training.
Distributed implementation strategy. Options:
"bidir"(default) - Bidirectional exchange between neighboring ranks"shift"- Sequential shift pattern through all ranks"reduce"- All-reduce operations"gather"- All-gather operations
"bidir" strategy is generally most efficient.Attributes
- cache_labels: Whether label caching is enabled
- rank: Current process rank
- world_size: Total number of processes
- dist_impl: Distribution strategy being used
- prev_num_logits: Cached logits count (for potential future optimizations)
- labels: Dictionary for cached labels (reserved for future use)
Key Methods
forward
image_features: Normalized image features of shape(batch_size, embed_dim)text_features: Normalized text features of shape(batch_size, embed_dim)logit_scale: Temperature parameter (typicallymodel.logit_scale.exp())logit_bias: Bias term added to logits (SigLIP typically uses a learned bias)output_dict: If True, returns dict with key “contrastive_loss”, else returns scalar
logit_bias parameter is required (not optional) for SigLIP.
get_logits
(batch_size, batch_size)
get_ground_truth
negative_only: If True, returns labels of all -1 (used for cross-GPU negative pairs)
- If
negative_only=False: Matrix with +1 on diagonal, -1 elsewhere (matching pairs are positive) - If
negative_only=True: Matrix of all -1 (all pairs are negative)
Usage Example
Distributed Training Example
Comparing Distribution Strategies
Dictionary Output
Mathematical Formulation
Given normalized image features and text features :-
Compute logits:
where is
logit_scaleand islogit_bias - Create targets: +1 & \text{if } i = j \text{ (matching pair)} \\ -1 & \text{otherwise (negative pair)} \end{cases}$$
- Compute loss: where is the sigmoid function
-
In distributed setting (with
world_size> 1):- Compute local loss on diagonal pairs:
- Exchange text features with other GPUs
- Compute cross-GPU losses on off-diagonal pairs
- Sum all losses
Key Differences from ClipLoss
| Aspect | ClipLoss | SigLipLoss |
|---|---|---|
| Loss function | Softmax + Cross-Entropy | Sigmoid + Binary Cross-Entropy |
| Normalization | Global (across batch) | Local (per pair) |
| Target labels | Class indices [0, 1, …, N-1] | Binary +1/-1 matrix |
| Logit bias | Optional | Typically required |
| Batch size scaling | Sublinear improvement | Near-linear improvement |
| Memory efficiency | Lower | Higher |
| Gradient flow | Coupled across batch | Independent per pair |
Advantages of SigLIP
- Better scaling: Performance improves more consistently with larger batch sizes
- Memory efficient: No need to compute full batch softmax
- Simpler gradients: Each pair contributes independently
- Improved performance: Often achieves better zero-shot accuracy
- Distributed friendly: Natural decomposition for multi-GPU training
Distribution Strategy Guide
bidir (bidirectional exchange) - Default, recommended- Exchanges features with left and right neighbors simultaneously
- Most efficient for typical multi-GPU setups
- Requires
world_size-1exchanges in(world_size-1)/2steps
- Exchanges features in a ring pattern
- Slightly slower than bidir but simpler
- Requires
world_size-1sequential steps
- Uses all-reduce to broadcast one GPU’s features at a time
- Less efficient but works on all hardware
- Good fallback option
- Gathers all features to all GPUs
- Most memory intensive
- Simplest to understand
Best Practices
-
Always use logit_bias with SigLIP:
-
Use bidir strategy for distributed training:
-
Normalize features before computing loss:
-
Scale batch size larger than with ClipLoss:
- SigLIP benefits more from large batches
- Aim for 4096+ global batch size if possible
-
Monitor logit_bias during training:
Related
- ClipLoss - Standard CLIP contrastive loss
- CLIP - Base model architecture
- Training Guide - Full training examples
- Paper - SigLIP paper
