decode

Overview

Decodes token tensors back into human-readable text. Reverses the tokenization process by converting token IDs to their corresponding text representation.

Function Signature

def decode(output_ids: torch.Tensor) -> str

Parameters

output_ids

torch.Tensor

required

Tensor of token IDs to decode. Can be 1D (single sequence) or 2D (batch of sequences). Token IDs are converted from GPU to CPU if needed.

Returns

text

str

Decoded text string. Special tokens (SOT, EOT) and padding are included in the output. The </w> BPE markers are converted to spaces.

Examples

Basic decoding

import open_clip
import torch

# Tokenize text
text = "a photo of a cat"
tokens = open_clip.tokenize(text)

# Decode back to text
decoded = open_clip.decode(tokens[0])
print(decoded)
# Output: '<start_of_text> a photo of a cat <end_of_text>'

Decode batch of tokens

# Tokenize multiple texts
texts = ["a cat", "a dog", "a bird"]
tokens = open_clip.tokenize(texts)

# Decode each sequence
for i, token_seq in enumerate(tokens):
    decoded = open_clip.decode(token_seq)
    print(f"Text {i}: {decoded}")

Decode model predictions

import torch
import open_clip

# Load model
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', 
    pretrained='laion2b_s34b_b79k'
)

# Get token embeddings (for demonstration)
text = "a photo of a cat"
tokens = open_clip.tokenize(text)

# Decode the tokens
decoded = open_clip.decode(tokens[0])
print(f"Original: {text}")
print(f"Decoded: {decoded}")

Handle padding and special tokens

import open_clip

# Short text with padding
text = "cat"
tokens = open_clip.tokenize(text)
print(f"Token shape: {tokens.shape}")  # torch.Size([1, 77])

# Decode includes padding (as empty space after EOT)
decoded = open_clip.decode(tokens[0])
print(f"Decoded: '{decoded}'")
# Contains: <start_of_text> cat <end_of_text> followed by padding

Remove special tokens

import open_clip

text = "a photo of a cat"
tokens = open_clip.tokenize(text)
decoded = open_clip.decode(tokens[0])

# Clean up the decoded text
cleaned = decoded.replace('<start_of_text>', '').replace('<end_of_text>', '').strip()
print(f"Cleaned: {cleaned}")
# Output: 'a photo of a cat'

Decode only non-padding tokens

import open_clip
import torch

text = "hello world"
tokens = open_clip.tokenize(text)

# Find non-zero (non-padding) tokens
non_padding_mask = tokens[0] != 0
non_padding_tokens = tokens[0][non_padding_mask]

# Decode only the actual content
decoded = open_clip.decode(non_padding_tokens)
print(decoded)

Decoding Process

The decode function:

Converts token IDs to BPE subword strings
Joins subwords together
Decodes byte representation to UTF-8 text
Replaces </w> markers with spaces
Handles special tokens like <start_of_text> and <end_of_text>

Token ID Reference

Token	ID	Description
`<start_of_text>`	49406	Start of sequence marker
`<end_of_text>`	49407	End of sequence marker
Padding	0	Zero padding

Notes

The function automatically moves tensors from GPU to CPU for decoding
Decoded text includes special tokens (<start_of_text>, <end_of_text>)
Padding tokens (ID: 0) decode to empty strings but may appear as spaces
BPE word boundaries (</w>) are converted to spaces in the output
This uses the module-level SimpleTokenizer instance
For custom tokenizers, call the .decode() method on the tokenizer instance directly

Error Handling

import open_clip
import torch

# Decode handles invalid token IDs gracefully
invalid_tokens = torch.tensor([99999, 100000, 100001])
decoded = open_clip.decode(invalid_tokens)
print(f"Decoded invalid tokens: {decoded}")
# Outputs replacement characters for unknown tokens

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

Overview

Function Signature

Parameters

Returns

Examples

Basic decoding

Decode batch of tokens

Decode model predictions

Handle padding and special tokens

Remove special tokens

Decode only non-padding tokens

Decoding Process

Token ID Reference

Notes

Error Handling

See Also

Build docs developers (and LLMs) love

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

​Overview

​Function Signature

​Parameters

​Returns

​Examples

​Basic decoding

​Decode batch of tokens

​Decode model predictions

​Handle padding and special tokens

​Remove special tokens

​Decode only non-padding tokens

​Decoding Process

​Token ID Reference

​Notes

​Error Handling

​See Also

Build docs developers (and LLMs) love

Overview

Function Signature

Parameters

Returns

Examples

Basic decoding

Decode batch of tokens

Decode model predictions

Handle padding and special tokens

Remove special tokens

Decode only non-padding tokens

Decoding Process

Token ID Reference

Notes

Error Handling

See Also