Skip to main content

Overview

TokenizerModule provides a class-based interface for tokenizer functionalities. It converts text to token IDs and vice versa, essential for text processing with language models.

When to Use

Use TokenizerModule when:
  • You need manual control over tokenization
  • You’re working outside React components
  • You need low-level token manipulation
  • You want to integrate tokenization into non-React code
Use useTokenizer hook when:
  • Building React components
  • You want automatic lifecycle management
  • You prefer declarative state management
  • You need React state integration

Constructor

new TokenizerModule()
Creates a new tokenizer module instance.

Example

import { TokenizerModule } from 'react-native-executorch';

const tokenizer = new TokenizerModule();

Methods

load()

async load(
  tokenizer: { tokenizerSource: ResourceSource },
  onDownloadProgressCallback?: (progress: number) => void
): Promise<void>
Loads the tokenizer from the specified source.

Parameters

tokenizer.tokenizerSource
ResourceSource
required
Resource location pointing to the tokenizer JSON file.
onDownloadProgressCallback
(progress: number) => void
Optional callback to monitor download progress (value between 0 and 1).

Example

await tokenizer.load(
  { tokenizerSource: 'https://example.com/tokenizer.json' },
  (progress) => {
    console.log(`Download: ${(progress * 100).toFixed(1)}%`);
  }
);

encode()

async encode(input: string): Promise<number[]>
Converts a string into an array of token IDs.

Parameters

input
string
required
The input string to be tokenized.

Returns

An array of token IDs.

Example

const tokens = await tokenizer.encode('Hello, world!');
console.log('Token IDs:', tokens);
// [9906, 11, 1917, 0]

console.log('Number of tokens:', tokens.length);

decode()

async decode(
  tokens: number[],
  skipSpecialTokens?: boolean
): Promise<string>
Converts an array of token IDs into a string.

Parameters

tokens
number[]
required
Array of token IDs to be decoded.
skipSpecialTokens
boolean
default:"true"
Whether to skip special tokens during decoding (e.g., [PAD], [CLS], [SEP]).

Returns

The decoded string.

Example

const text = await tokenizer.decode([9906, 11, 1917, 0]);
console.log('Decoded text:', text);
// "Hello, world!"

// Include special tokens
const textWithSpecial = await tokenizer.decode([101, 9906, 102], false);
console.log('With special tokens:', textWithSpecial);
// "[CLS] Hello [SEP]"

getVocabSize()

async getVocabSize(): Promise<number>
Returns the size of the tokenizer’s vocabulary.

Returns

The vocabulary size.

Example

const vocabSize = await tokenizer.getVocabSize();
console.log('Vocabulary size:', vocabSize);
// 50257 (for GPT-2 tokenizer)

idToToken()

async idToToken(tokenId: number): Promise<string>
Returns the token string associated with the given ID.

Parameters

tokenId
number
required
ID of the token.

Returns

The token string associated with the ID.

Example

const token = await tokenizer.idToToken(9906);
console.log('Token:', token);
// "Hello"

tokenToId()

async tokenToId(token: string): Promise<number>
Returns the ID associated with the given token string.

Parameters

token
string
required
The token string.

Returns

The ID associated with the token.

Example

const id = await tokenizer.tokenToId('Hello');
console.log('Token ID:', id);
// 9906

Complete Example: Text Analysis

import { TokenizerModule } from 'react-native-executorch';

class TextAnalyzer {
  private tokenizer: TokenizerModule;

  constructor() {
    this.tokenizer = new TokenizerModule();
  }

  async initialize() {
    console.log('Loading tokenizer...');
    await this.tokenizer.load(
      { tokenizerSource: 'https://example.com/tokenizer.json' },
      (progress) => {
        console.log(`Loading: ${(progress * 100).toFixed(0)}%`);
      }
    );
    console.log('Tokenizer ready!');
  }

  async analyzeText(text: string) {
    const tokens = await this.tokenizer.encode(text);
    const decoded = await this.tokenizer.decode(tokens);
    const vocabSize = await this.tokenizer.getVocabSize();
    
    // Get individual tokens
    const tokenStrings = await Promise.all(
      tokens.map(id => this.tokenizer.idToToken(id))
    );
    
    return {
      originalText: text,
      decodedText: decoded,
      tokenCount: tokens.length,
      tokenIds: tokens,
      tokens: tokenStrings,
      vocabSize,
      avgTokenLength: text.length / tokens.length
    };
  }

  async countTokens(text: string): Promise<number> {
    const tokens = await this.tokenizer.encode(text);
    return tokens.length;
  }
}

// Usage
const analyzer = new TextAnalyzer();
await analyzer.initialize();

const result = await analyzer.analyzeText('Hello, how are you doing today?');

console.log('Analysis Results:');
console.log('Original text:', result.originalText);
console.log('Token count:', result.tokenCount);
console.log('Tokens:', result.tokens);
console.log('Token IDs:', result.tokenIds);
console.log('Vocabulary size:', result.vocabSize);
console.log('Avg chars per token:', result.avgTokenLength.toFixed(2));

// Quick token counting
const count = await analyzer.countTokens('This is a test.');
console.log('Tokens in text:', count);

Example: Token Budget Manager

class TokenBudgetManager {
  private tokenizer: TokenizerModule;
  private maxTokens: number;

  constructor(maxTokens: number) {
    this.tokenizer = new TokenizerModule();
    this.maxTokens = maxTokens;
  }

  async initialize() {
    await this.tokenizer.load({
      tokenizerSource: 'https://example.com/tokenizer.json'
    });
  }

  async truncateToFit(text: string): Promise<string> {
    const tokens = await this.tokenizer.encode(text);
    
    if (tokens.length <= this.maxTokens) {
      return text;
    }
    
    // Truncate tokens
    const truncatedTokens = tokens.slice(0, this.maxTokens);
    return await this.tokenizer.decode(truncatedTokens);
  }

  async splitIntoChunks(text: string): Promise<string[]> {
    const tokens = await this.tokenizer.encode(text);
    const chunks: string[] = [];
    
    for (let i = 0; i < tokens.length; i += this.maxTokens) {
      const chunkTokens = tokens.slice(i, i + this.maxTokens);
      const chunkText = await this.tokenizer.decode(chunkTokens);
      chunks.push(chunkText);
    }
    
    return chunks;
  }

  async getRemainingBudget(text: string): Promise<number> {
    const tokens = await this.tokenizer.encode(text);
    return Math.max(0, this.maxTokens - tokens.length);
  }
}

// Usage
const budgetManager = new TokenBudgetManager(100); // Max 100 tokens
await budgetManager.initialize();

const longText = 'Very long text that exceeds the token budget...';

// Truncate to fit
const truncated = await budgetManager.truncateToFit(longText);
console.log('Truncated text:', truncated);

// Split into chunks
const chunks = await budgetManager.splitIntoChunks(longText);
console.log('Chunks:', chunks.length);

// Check remaining budget
const remaining = await budgetManager.getRemainingBudget('Current text');
console.log('Tokens remaining:', remaining);

Example: Vocabulary Explorer

class VocabularyExplorer {
  private tokenizer: TokenizerModule;

  constructor() {
    this.tokenizer = new TokenizerModule();
  }

  async initialize() {
    await this.tokenizer.load({
      tokenizerSource: 'https://example.com/tokenizer.json'
    });
  }

  async findTokenIds(words: string[]): Promise<Map<string, number>> {
    const results = new Map<string, number>();
    
    for (const word of words) {
      const id = await this.tokenizer.tokenToId(word);
      results.set(word, id);
    }
    
    return results;
  }

  async compareTokenization(texts: string[]) {
    const results = [];
    
    for (const text of texts) {
      const tokens = await this.tokenizer.encode(text);
      const tokenStrings = await Promise.all(
        tokens.map(id => this.tokenizer.idToToken(id))
      );
      
      results.push({
        text,
        tokenCount: tokens.length,
        tokens: tokenStrings
      });
    }
    
    return results;
  }

  async exploreRange(startId: number, endId: number) {
    const tokens = [];
    
    for (let id = startId; id <= endId; id++) {
      const token = await this.tokenizer.idToToken(id);
      tokens.push({ id, token });
    }
    
    return tokens;
  }
}

// Usage
const explorer = new VocabularyExplorer();
await explorer.initialize();

// Find specific token IDs
const ids = await explorer.findTokenIds(['hello', 'world', 'AI']);
ids.forEach((id, word) => {
  console.log(`"${word}" -> ${id}`);
});

// Compare tokenization
const comparison = await explorer.compareTokenization([
  'Hello world',
  'Hello, world!',
  'hello world'
]);
comparison.forEach(result => {
  console.log(`"${result.text}" -> ${result.tokenCount} tokens`);
  console.log('  Tokens:', result.tokens);
});

// Explore token range
const range = await explorer.exploreRange(0, 10);
range.forEach(({ id, token }) => {
  console.log(`${id}: "${token}"`);
});

Example: Encode/Decode Utilities

class TokenizerUtils {
  private tokenizer: TokenizerModule;

  constructor() {
    this.tokenizer = new TokenizerModule();
  }

  async initialize() {
    await this.tokenizer.load({
      tokenizerSource: 'https://example.com/tokenizer.json'
    });
  }

  async encodeWithDetails(text: string) {
    const tokens = await this.tokenizer.encode(text);
    
    const details = await Promise.all(
      tokens.map(async (id, index) => ({
        index,
        id,
        token: await this.tokenizer.idToToken(id)
      }))
    );
    
    return {
      text,
      totalTokens: tokens.length,
      details
    };
  }

  async roundTrip(text: string): Promise<boolean> {
    const tokens = await this.tokenizer.encode(text);
    const decoded = await this.tokenizer.decode(tokens);
    return text === decoded;
  }

  async getTokenBoundaries(text: string) {
    const tokens = await this.tokenizer.encode(text);
    const tokenStrings = await Promise.all(
      tokens.map(id => this.tokenizer.idToToken(id))
    );
    
    const boundaries = [];
    let position = 0;
    
    for (const token of tokenStrings) {
      const start = position;
      const end = position + token.length;
      boundaries.push({ token, start, end });
      position = end;
    }
    
    return boundaries;
  }
}

// Usage
const utils = new TokenizerUtils();
await utils.initialize();

// Detailed encoding
const details = await utils.encodeWithDetails('Hello, world!');
console.log('Encoding details:');
details.details.forEach(d => {
  console.log(`  [${d.index}] ID ${d.id}: "${d.token}"`);
});

// Test round-trip encoding
const isLossless = await utils.roundTrip('Test text');
console.log('Lossless encoding:', isLossless);

// Get token boundaries
const boundaries = await utils.getTokenBoundaries('Hello, world!');
boundaries.forEach(b => {
  console.log(`"${b.token}" at ${b.start}-${b.end}`);
});

Use Cases

  • Token Counting: Count tokens for API limits or cost estimation
  • Text Truncation: Trim text to fit within token budgets
  • Text Chunking: Split long texts into token-based chunks
  • Vocabulary Analysis: Explore tokenizer vocabulary
  • Model Input Preparation: Convert text to token IDs for models
  • Token-level Processing: Manipulate text at the token level
  • Cost Estimation: Estimate API costs based on token usage

Common Tokenizers

  • GPT-2/GPT-3: Byte-pair encoding (BPE) with ~50k vocab
  • BERT: WordPiece tokenization with ~30k vocab
  • T5: SentencePiece with ~32k vocab
  • LLaMA: SentencePiece with ~32k vocab

Performance Considerations

  • Tokenization is very fast (typically < 1ms for short texts)
  • Cache tokenized results for frequently used texts
  • Empty array decode returns empty string immediately
  • Always use the same tokenizer as the model you’re working with

See Also

Build docs developers (and LLMs) love