TokenizerModule - React Native ExecuTorch

Overview

TokenizerModule provides a class-based interface for tokenizer functionalities. It converts text to token IDs and vice versa, essential for text processing with language models.

When to Use

Use TokenizerModule when:

You need manual control over tokenization
You’re working outside React components
You need low-level token manipulation
You want to integrate tokenization into non-React code

Use useTokenizer hook when:

Building React components
You want automatic lifecycle management
You prefer declarative state management
You need React state integration

Constructor

new TokenizerModule()

Creates a new tokenizer module instance.

Example

import { TokenizerModule } from 'react-native-executorch';

const tokenizer = new TokenizerModule();

Methods

load()

async load(
  tokenizer: { tokenizerSource: ResourceSource },
  onDownloadProgressCallback?: (progress: number) => void
): Promise<void>

Loads the tokenizer from the specified source.

Parameters

tokenizer.tokenizerSource

ResourceSource

required

Resource location pointing to the tokenizer JSON file.

onDownloadProgressCallback

(progress: number) => void

Optional callback to monitor download progress (value between 0 and 1).

Example

await tokenizer.load(
  { tokenizerSource: 'https://example.com/tokenizer.json' },
  (progress) => {
    console.log(`Download: ${(progress * 100).toFixed(1)}%`);
  }
);

encode()

async encode(input: string): Promise<number[]>

Converts a string into an array of token IDs.

Parameters

input

string

required

The input string to be tokenized.

Returns

An array of token IDs.

Example

const tokens = await tokenizer.encode('Hello, world!');
console.log('Token IDs:', tokens);
// [9906, 11, 1917, 0]

console.log('Number of tokens:', tokens.length);

decode()

async decode(
  tokens: number[],
  skipSpecialTokens?: boolean
): Promise<string>

Converts an array of token IDs into a string.

Parameters

tokens

number[]

required

Array of token IDs to be decoded.

skipSpecialTokens

boolean

default:"true"

Whether to skip special tokens during decoding (e.g., [PAD], [CLS], [SEP]).

Returns

The decoded string.

Example

const text = await tokenizer.decode([9906, 11, 1917, 0]);
console.log('Decoded text:', text);
// "Hello, world!"

// Include special tokens
const textWithSpecial = await tokenizer.decode([101, 9906, 102], false);
console.log('With special tokens:', textWithSpecial);
// "[CLS] Hello [SEP]"

getVocabSize()

async getVocabSize(): Promise<number>

Returns the size of the tokenizer’s vocabulary.

Returns

The vocabulary size.

Example

const vocabSize = await tokenizer.getVocabSize();
console.log('Vocabulary size:', vocabSize);
// 50257 (for GPT-2 tokenizer)

idToToken()

async idToToken(tokenId: number): Promise<string>

Returns the token string associated with the given ID.

Parameters

tokenId

number

required

ID of the token.

Returns

The token string associated with the ID.

Example

const token = await tokenizer.idToToken(9906);
console.log('Token:', token);
// "Hello"

tokenToId()

async tokenToId(token: string): Promise<number>

Returns the ID associated with the given token string.

Parameters

token

string

required

The token string.

Returns

The ID associated with the token.

Example

const id = await tokenizer.tokenToId('Hello');
console.log('Token ID:', id);
// 9906

Complete Example: Text Analysis

import { TokenizerModule } from 'react-native-executorch';

class TextAnalyzer {
  private tokenizer: TokenizerModule;

  constructor() {
    this.tokenizer = new TokenizerModule();
  }

  async initialize() {
    console.log('Loading tokenizer...');
    await this.tokenizer.load(
      { tokenizerSource: 'https://example.com/tokenizer.json' },
      (progress) => {
        console.log(`Loading: ${(progress * 100).toFixed(0)}%`);
      }
    );
    console.log('Tokenizer ready!');
  }

  async analyzeText(text: string) {
    const tokens = await this.tokenizer.encode(text);
    const decoded = await this.tokenizer.decode(tokens);
    const vocabSize = await this.tokenizer.getVocabSize();
    
    // Get individual tokens
    const tokenStrings = await Promise.all(
      tokens.map(id => this.tokenizer.idToToken(id))
    );
    
    return {
      originalText: text,
      decodedText: decoded,
      tokenCount: tokens.length,
      tokenIds: tokens,
      tokens: tokenStrings,
      vocabSize,
      avgTokenLength: text.length / tokens.length
    };
  }

  async countTokens(text: string): Promise<number> {
    const tokens = await this.tokenizer.encode(text);
    return tokens.length;
  }
}

// Usage
const analyzer = new TextAnalyzer();
await analyzer.initialize();

const result = await analyzer.analyzeText('Hello, how are you doing today?');

console.log('Analysis Results:');
console.log('Original text:', result.originalText);
console.log('Token count:', result.tokenCount);
console.log('Tokens:', result.tokens);
console.log('Token IDs:', result.tokenIds);
console.log('Vocabulary size:', result.vocabSize);
console.log('Avg chars per token:', result.avgTokenLength.toFixed(2));

// Quick token counting
const count = await analyzer.countTokens('This is a test.');
console.log('Tokens in text:', count);

Example: Token Budget Manager

class TokenBudgetManager {
  private tokenizer: TokenizerModule;
  private maxTokens: number;

  constructor(maxTokens: number) {
    this.tokenizer = new TokenizerModule();
    this.maxTokens = maxTokens;
  }

  async initialize() {
    await this.tokenizer.load({
      tokenizerSource: 'https://example.com/tokenizer.json'
    });
  }

  async truncateToFit(text: string): Promise<string> {
    const tokens = await this.tokenizer.encode(text);
    
    if (tokens.length <= this.maxTokens) {
      return text;
    }
    
    // Truncate tokens
    const truncatedTokens = tokens.slice(0, this.maxTokens);
    return await this.tokenizer.decode(truncatedTokens);
  }

  async splitIntoChunks(text: string): Promise<string[]> {
    const tokens = await this.tokenizer.encode(text);
    const chunks: string[] = [];
    
    for (let i = 0; i < tokens.length; i += this.maxTokens) {
      const chunkTokens = tokens.slice(i, i + this.maxTokens);
      const chunkText = await this.tokenizer.decode(chunkTokens);
      chunks.push(chunkText);
    }
    
    return chunks;
  }

  async getRemainingBudget(text: string): Promise<number> {
    const tokens = await this.tokenizer.encode(text);
    return Math.max(0, this.maxTokens - tokens.length);
  }
}

// Usage
const budgetManager = new TokenBudgetManager(100); // Max 100 tokens
await budgetManager.initialize();

const longText = 'Very long text that exceeds the token budget...';

// Truncate to fit
const truncated = await budgetManager.truncateToFit(longText);
console.log('Truncated text:', truncated);

// Split into chunks
const chunks = await budgetManager.splitIntoChunks(longText);
console.log('Chunks:', chunks.length);

// Check remaining budget
const remaining = await budgetManager.getRemainingBudget('Current text');
console.log('Tokens remaining:', remaining);

Example: Vocabulary Explorer

class VocabularyExplorer {
  private tokenizer: TokenizerModule;

  constructor() {
    this.tokenizer = new TokenizerModule();
  }

  async initialize() {
    await this.tokenizer.load({
      tokenizerSource: 'https://example.com/tokenizer.json'
    });
  }

  async findTokenIds(words: string[]): Promise<Map<string, number>> {
    const results = new Map<string, number>();
    
    for (const word of words) {
      const id = await this.tokenizer.tokenToId(word);
      results.set(word, id);
    }
    
    return results;
  }

  async compareTokenization(texts: string[]) {
    const results = [];
    
    for (const text of texts) {
      const tokens = await this.tokenizer.encode(text);
      const tokenStrings = await Promise.all(
        tokens.map(id => this.tokenizer.idToToken(id))
      );
      
      results.push({
        text,
        tokenCount: tokens.length,
        tokens: tokenStrings
      });
    }
    
    return results;
  }

  async exploreRange(startId: number, endId: number) {
    const tokens = [];
    
    for (let id = startId; id <= endId; id++) {
      const token = await this.tokenizer.idToToken(id);
      tokens.push({ id, token });
    }
    
    return tokens;
  }
}

// Usage
const explorer = new VocabularyExplorer();
await explorer.initialize();

// Find specific token IDs
const ids = await explorer.findTokenIds(['hello', 'world', 'AI']);
ids.forEach((id, word) => {
  console.log(`"${word}" -> ${id}`);
});

// Compare tokenization
const comparison = await explorer.compareTokenization([
  'Hello world',
  'Hello, world!',
  'hello world'
]);
comparison.forEach(result => {
  console.log(`"${result.text}" -> ${result.tokenCount} tokens`);
  console.log('  Tokens:', result.tokens);
});

// Explore token range
const range = await explorer.exploreRange(0, 10);
range.forEach(({ id, token }) => {
  console.log(`${id}: "${token}"`);
});

Example: Encode/Decode Utilities

class TokenizerUtils {
  private tokenizer: TokenizerModule;

  constructor() {
    this.tokenizer = new TokenizerModule();
  }

  async initialize() {
    await this.tokenizer.load({
      tokenizerSource: 'https://example.com/tokenizer.json'
    });
  }

  async encodeWithDetails(text: string) {
    const tokens = await this.tokenizer.encode(text);
    
    const details = await Promise.all(
      tokens.map(async (id, index) => ({
        index,
        id,
        token: await this.tokenizer.idToToken(id)
      }))
    );
    
    return {
      text,
      totalTokens: tokens.length,
      details
    };
  }

  async roundTrip(text: string): Promise<boolean> {
    const tokens = await this.tokenizer.encode(text);
    const decoded = await this.tokenizer.decode(tokens);
    return text === decoded;
  }

  async getTokenBoundaries(text: string) {
    const tokens = await this.tokenizer.encode(text);
    const tokenStrings = await Promise.all(
      tokens.map(id => this.tokenizer.idToToken(id))
    );
    
    const boundaries = [];
    let position = 0;
    
    for (const token of tokenStrings) {
      const start = position;
      const end = position + token.length;
      boundaries.push({ token, start, end });
      position = end;
    }
    
    return boundaries;
  }
}

// Usage
const utils = new TokenizerUtils();
await utils.initialize();

// Detailed encoding
const details = await utils.encodeWithDetails('Hello, world!');
console.log('Encoding details:');
details.details.forEach(d => {
  console.log(`  [${d.index}] ID ${d.id}: "${d.token}"`);
});

// Test round-trip encoding
const isLossless = await utils.roundTrip('Test text');
console.log('Lossless encoding:', isLossless);

// Get token boundaries
const boundaries = await utils.getTokenBoundaries('Hello, world!');
boundaries.forEach(b => {
  console.log(`"${b.token}" at ${b.start}-${b.end}`);
});

Use Cases

Token Counting: Count tokens for API limits or cost estimation
Text Truncation: Trim text to fit within token budgets
Text Chunking: Split long texts into token-based chunks
Vocabulary Analysis: Explore tokenizer vocabulary
Model Input Preparation: Convert text to token IDs for models
Token-level Processing: Manipulate text at the token level
Cost Estimation: Estimate API costs based on token usage

Common Tokenizers

GPT-2/GPT-3: Byte-pair encoding (BPE) with ~50k vocab
BERT: WordPiece tokenization with ~30k vocab
T5: SentencePiece with ~32k vocab
LLaMA: SentencePiece with ~32k vocab

Performance Considerations

Tokenization is very fast (typically < 1ms for short texts)
Cache tokenized results for frequently used texts
Empty array decode returns empty string immediately
Always use the same tokenizer as the model you’re working with

Initialization

LLM Hooks

Computer Vision Hooks

Speech Hooks

Text Embeddings Hooks

General Hooks

Modules

Types

Constants

Errors

​Overview

​When to Use

​Constructor

​Example

​Methods

​load()

​Parameters

​Example

​encode()

​Parameters

​Returns

​Example

​decode()

​Parameters

​Returns

​Example

​getVocabSize()

​Returns

​Example

​idToToken()

​Parameters

​Returns

​Example

​tokenToId()

​Parameters

​Returns

​Example

​Complete Example: Text Analysis

​Example: Token Budget Manager

​Example: Vocabulary Explorer

​Example: Encode/Decode Utilities

​Use Cases

​Common Tokenizers

​Performance Considerations

​See Also

Build docs developers (and LLMs) love

Overview

When to Use

Constructor

Example

Methods

load()

Parameters

Example

encode()

Parameters

Returns

Example

decode()

Parameters

Returns

Example

getVocabSize()

Returns

Example

idToToken()

Parameters

Returns

Example

tokenToId()

Parameters

Returns

Example

Complete Example: Text Analysis

Example: Token Budget Manager

Example: Vocabulary Explorer

Example: Encode/Decode Utilities

Use Cases

Common Tokenizers

Performance Considerations

See Also