Overview
TheCoCa (Contrastive Captioner) class implements a multimodal model that combines contrastive learning (like CLIP) with generative captioning capabilities. It consists of:
- Image encoder - Encodes images into a latent space and produces token embeddings
- Text encoder - Encodes text into contrastive features for CLIP-style learning
- Multimodal text decoder - Generates captions conditioned on image embeddings
Class Definition
Initialization Parameters
Dimensionality of the joint embedding space for contrastive image and text features.
Configuration for the multimodal text decoder. Controls the cross-attention layers that condition text generation on image features.
Configuration for the unimodal text encoder used for contrastive learning.
Configuration for the vision encoder.
Use QuickGELU activation instead of standard GELU.
Initial value for the learned temperature parameter in contrastive learning.
Optional learnable bias term added to contrastive logits.
If True, logit_scale has shape [1] instead of [].
Precision for model computations (e.g., torch.float16, torch.bfloat16).
Token ID used for padding in the vocabulary.
Attributes
- visual: Vision encoder module
- text: Unimodal text encoder for contrastive learning
- text_decoder: Multimodal transformer decoder for caption generation
- logit_scale: Learned temperature parameter for contrastive learning
- logit_bias: Optional learned bias for contrastive logits
- pad_id: Padding token ID
- context_length: Maximum sequence length for caption generation
Key Methods
encode_image
images: Image tensor of shape(batch_size, channels, height, width)normalize: If True, L2-normalizes the output features
(batch_size, embed_dim)
encode_text
text: Tokenized text tensor of shape(batch_size, context_length)normalize: If True, L2-normalizes the output features
(batch_size, embed_dim)
forward
image: Image tensor of shape(batch_size, channels, height, width)text: Optional tokenized text for teacher-forcing caption generationimage_latent: Optional pre-computed contrastive image featuresimage_embs: Optional pre-computed image token embeddingsoutput_labels: If True, creates caption labels by shifting text input
image_features: Contrastive image embeddings(batch_size, embed_dim)text_features: Contrastive text embeddings(batch_size, embed_dim)(if text provided)logits: Caption generation logits(batch_size, seq_len, vocab_size)(if text provided)labels: Ground truth labels for caption loss(batch_size, seq_len-1)(if output_labels=True)logit_scale: Exponential of learned temperature parameterlogit_bias: Learned bias (if initialized with init_logit_bias)image_embs: Image token embeddings (if text not provided)
generate
image: Image tensor to captiontext: Optional text prefix to continue fromseq_len: Target sequence length for generationmax_seq_len: Maximum context length (default 77)temperature: Sampling temperature (higher = more random)generation_type: One of “beam_search”, “top_p”, or “top_k”top_p: Nucleus sampling parameter (keep tokens in top-p probability mass)top_k: Top-k sampling parameter (keep top k tokens)pad_token_id: Padding token (default 0)eos_token_id: End-of-sequence token (default 49407)sot_token_id: Start-of-text token (default 49406)num_beams: Number of beams for beam searchnum_beam_groups: Number of beam groups for diverse beam searchmin_seq_len: Minimum generated sequence lengthstopping_criteria: Optional list of stopping criteriarepetition_penalty: Penalty for repeated tokens (1.0 = no penalty)fixed_output_length: If True, pad output to seq_len
(batch_size, seq_len)
Note: Requires transformers library: pip install transformers
set_grad_checkpointing
Usage Example
Contrastive Learning Example
Caption Generation with Custom Parameters
Related
- CLIP - Base contrastive learning model
- CoCaLoss - Combined loss function for CoCa training
- MultimodalTransformer - Decoder architecture
