Speech Tokenizer - Qwen3-TTS

Overview

The Qwen3-TTS-Tokenizer-12Hz is a neural audio codec that converts speech waveforms into discrete codes and reconstructs them back to audio. It serves as the foundation of the Qwen3-TTS system, enabling efficient acoustic compression and high-fidelity speech reconstruction.

The tokenizer operates at 12Hz frame rate, meaning it processes audio at 12 frames per second, providing an optimal balance between compression efficiency and quality.

Key Features

Efficient Compression

Achieves ~1920x compression ratio (24kHz audio to 12Hz codes)

High Fidelity

Preserves paralinguistic information and acoustic details

Multi-Codebook

Uses 16 residual quantizers for fine-grained representation

Lightweight

Non-DiT architecture enables high-speed encoding/decoding

Architecture

The tokenizer consists of two main components:

1. Encoder

Based on: Mimi architecture (modified Encodec) Function: Converts raw audio waveforms into discrete codes Configuration:

Input: 24kHz audio waveform
Output: Discrete codes (codes_length, 16 quantizers)
Codebook Size: 2048 entries per quantizer
Frame Rate: 12Hz (1920 samples per frame)
Quantizers: 16 residual vector quantizers

Process:

Convolutional downsampling reduces audio to latent representation
Residual Vector Quantization (RVQ) compresses latents to discrete codes
Each of 16 quantizers captures different levels of acoustic detail
Output is a matrix of shape (T, 16) where T = duration_seconds × 12

2. Decoder

Architecture: Autoregressive transformer with sliding window attention Function: Reconstructs audio waveforms from discrete codes Configuration:

Hidden Size: 1024
Layers: 8 transformer blocks
Attention: Sliding window (window size = 72)
Heads: 16 attention heads (16 key-value heads)
Upsampling: Transposed convolutions with rates (2, 2)
Output Rate: 24kHz audio

Process:

Code embeddings are processed by transformer layers
Sliding window attention captures local acoustic context
ConvNeXt blocks refine intermediate representations
Transposed convolutions upsample to original sample rate
Final layer produces 24kHz audio waveform

Technical Specifications

Compression Details

Parameter	Value
Input Sample Rate	24,000 Hz
Output Sample Rate	24,000 Hz
Frame Rate	12 Hz
Samples per Frame	2,000
Compression Ratio	~1920x
Number of Quantizers	16
Codebook Size	2,048 per quantizer
Bits per Frame	176 bits (16 quantizers × 11 bits)
Bitrate	~2.1 kbps

The 1920x compression is calculated as: 24000 samples/sec ÷ (12 frames/sec × 1 code/frame) ≈ 2000 samples per code.

Model Parameters

Encoder (from MimiConfig):

Based on Kyutai Mimi architecture
Multi-scale discriminator for training
Perceptual loss functions

Decoder:

Hidden dimension: 1024
Intermediate size: 3072
Decoder output dimension: 1536
Total parameters: ~100M

How It Works

Encoding Process

from qwen_tts import Qwen3TTSTokenizer
import soundfile as sf

tokenizer = Qwen3TTSTokenizer.from_pretrained(
    "Qwen/Qwen3-TTS-Tokenizer-12Hz",
    device_map="cuda:0"
)

# Encode audio to discrete codes
codes = tokenizer.encode("input.wav")

print(f"Input duration: {duration}s")
print(f"Code shape: {codes[0].shape}")  # (duration*12, 16)

What happens during encoding:

Audio Loading: Input audio is loaded and resampled to 24kHz if needed
Convolutional Encoding: Strided convolutions downsample audio to latent space
Quantization: 16-layer residual vector quantization converts continuous latents to discrete codes
Code Output: Each frame (1/12 second) becomes 16 discrete codes

Supported Input Formats
Output Format

File path: Local audio file (WAV, MP3, FLAC, etc.)
URL: HTTP/HTTPS URL to audio file
NumPy array: (samples,) or (batch, samples)
Tuple: (numpy_array, sample_rate)
Base64: Base64-encoded audio string

Returns List[torch.LongTensor] where each tensor has shape:

(codes_length, num_quantizers) = (T*12, 16)
T = duration in seconds
Values in range [0, 2047] (codebook size)

Decoding Process

# Decode codes back to audio
wavs, sr = tokenizer.decode(codes)

print(f"Output sample rate: {sr}")  # 24000
print(f"Audio shape: {wavs[0].shape}")  # (duration*24000,)

# Save to file
sf.write("output.wav", wavs[0], sr)

What happens during decoding:

Code Embedding: Each of 16 codes is embedded into continuous space
Transformer Processing: Autoregressive transformer refines representations
Sliding Window Attention: Captures local dependencies (window=72 frames)
Upsampling: Transposed convolutions upsample to 24kHz
Waveform Generation: Final layer produces high-quality audio

Input Format
Output Format

Accepts codes from encoder:

List[torch.LongTensor]: List of code tensors
dict: {"audio_codes": [...]} format
Shape: (codes_length, 16) per item

Returns:

wavs: List[np.ndarray] - Reconstructed audio
sr: int - Sample rate (24000)

Each audio array has shape (duration*24000,)

Acoustic Compression

Residual Vector Quantization (RVQ)

The tokenizer uses 16 hierarchical quantizers to capture acoustic information:

Layer 1 (Q0):  Coarse acoustic features (fundamental frequency, energy)
Layer 2 (Q1):  Refined pitch and loudness contours  
Layer 3 (Q2):  Spectral envelope (vowel formants)
Layer 4-8:     Fine-grained spectral details
Layer 9-12:    Micro-prosodic features
Layer 13-16:   Background acoustics and residuals

How RVQ works:

First Quantizer: Approximates latent vector using 2048 codebook entries
Residual Calculation: Computes difference between approximation and target
Second Quantizer: Quantizes the residual
Iteration: Repeats for all 16 layers
Reconstruction: Sum of all quantized layers reconstructs original signal

Each quantizer captures progressively finer details. Early quantizers (Q0-Q3) are most important for intelligibility, while later quantizers (Q8-Q15) preserve naturalness and speaker characteristics.

Compression Efficiency

For a 10-second audio clip:

Format	Size	Compression vs. WAV
Raw WAV (24kHz, 16-bit)	480 KB	1x (baseline)
MP3 (128 kbps)	160 KB	3x
Qwen3 Codes	2.64 KB	182x

The codes are stored as integers, not compressed audio. For transmission, codes can be further compressed with standard algorithms (gzip, etc.) for additional 2-3x reduction.

Batch Processing

Batch Encoding

# Encode multiple audio files
files = ["audio1.wav", "audio2.wav", "audio3.wav"]
codes_list = tokenizer.encode(files)

# Or encode from NumPy arrays
audio_arrays = [np.random.randn(24000), np.random.randn(48000)]
codes_list = tokenizer.encode(audio_arrays)

Batch Decoding

# Decode multiple code sequences
wavs, sr = tokenizer.decode(codes_list)

# Save all outputs
for i, wav in enumerate(wavs):
    sf.write(f"output_{i}.wav", wav, sr)

Batch processing requires all items to be padded to the same length. The tokenizer handles this automatically but may increase memory usage for variable-length inputs.

Use Cases

Audio Compression & Storage

Compress audio files to discrete codes for efficient storage:

# Compress entire audio library
codes = tokenizer.encode("large_audiobook.wav")
torch.save(codes, "audiobook_compressed.pt")

# Later: decompress and play
codes = torch.load("audiobook_compressed.pt")
wavs, sr = tokenizer.decode(codes)

Achieves ~182x compression vs. WAV, ~60x vs. MP3.

Network Transmission

Stream audio over networks with minimal bandwidth:

# Sender: encode and transmit codes
codes = tokenizer.encode("speech.wav")
send_over_network(codes)  # Only 2.1 kbps needed

# Receiver: decode and play
codes = receive_from_network()
wavs, sr = tokenizer.decode(codes)
play_audio(wavs[0], sr)

Training Data Preparation

Pre-encode audio datasets for faster TTS training:

# Pre-process training data
for audio_file in training_dataset:
    codes = tokenizer.encode(audio_file)
    save_codes(codes, audio_file.stem + ".codes")

# During training: load codes directly (faster than loading audio)

Speech Codec Research

Analyze speech representations and acoustic properties:

# Encode and analyze codebook usage
codes = tokenizer.encode("speech.wav")

# Analyze which quantizers capture what information
for q_idx in range(16):
    quantizer_codes = codes[0][:, q_idx]
    analyze_distribution(quantizer_codes)

Performance Characteristics

Latency

Operation	Latency (GPU)	Latency (CPU)
Encode 1s	~10-15ms	~100-150ms
Decode 1s	~20-30ms	~200-300ms
Round-trip	~30-45ms	~300-450ms

Latencies measured on NVIDIA A100 (GPU) and Intel Xeon (CPU). Actual times vary by hardware.

Quality Metrics

Metric	Value
PESQ (Perceptual Quality)	4.2-4.4
MOS (Mean Opinion Score)	4.3-4.5
Speaker Similarity	>0.85
Word Error Rate	Less than 1% (with good ASR)

Quality is comparable to high-bitrate traditional codecs (128 kbps MP3) despite ~60x lower bitrate.

Advanced Usage

Custom Quantizer Subsets

# Use only first 8 quantizers (lower quality, higher speed)
tokenizer = Qwen3TTSTokenizer.from_pretrained(
    "Qwen/Qwen3-TTS-Tokenizer-12Hz",
    device_map="cuda:0"
)

# Encode with all quantizers
codes = tokenizer.encode("input.wav")

# Decode using only first 8 quantizers
partial_codes = [c[:, :8] for c in codes]
wavs, sr = tokenizer.decode(partial_codes)  # Lower quality, faster

Code Manipulation

# Encode two audio files
codes1 = tokenizer.encode("speaker1.wav")
codes2 = tokenizer.encode("speaker2.wav")

# Mix coarse features from speaker1 with fine details from speaker2
mixed_codes = torch.cat([
    codes1[0][:, :4],   # First 4 quantizers from speaker 1
    codes2[0][:, 4:]    # Last 12 quantizers from speaker 2
], dim=1)

# Decode hybrid codes
wavs, sr = tokenizer.decode([mixed_codes])

Code manipulation is experimental and may produce artifacts. Not all code combinations result in valid audio.

Integration with TTS Models

The tokenizer works seamlessly with TTS models:

from qwen_tts import Qwen3TTSModel, Qwen3TTSTokenizer

# Load model and tokenizer
model = Qwen3TTSModel.from_pretrained("Qwen/Qwen3-TTS-12Hz-1.7B-Base")
tokenizer = Qwen3TTSTokenizer.from_pretrained("Qwen/Qwen3-TTS-Tokenizer-12Hz")

# Generate speech codes (model output)
codes = model.generate_codes(
    text="Hello world",
    ref_audio="reference.wav",
    ref_text="Reference transcript"
)

# Convert codes to audio using tokenizer
wavs, sr = tokenizer.decode(codes)

TTS models generate codes in the same format as the tokenizer encoder, enabling seamless integration.

Limitations

Important limitations to consider:

Music quality: Optimized for speech; music may have artifacts
Background noise: Very noisy audio may lose fidelity
Extreme pitch: Very high/low pitch may not encode perfectly
Non-speech sounds: Best for human speech; other sounds may degrade

Next Steps

Architecture

Learn how the tokenizer fits into the overall system

Voice Cloning

Use the tokenizer for voice cloning applications

API Reference

Detailed API documentation for the tokenizer

Examples

See practical examples and code snippets

Get Started

Core Concepts

Guides

Advanced

​Overview

​Key Features

Efficient Compression

High Fidelity

Multi-Codebook

Lightweight

​Architecture

​1. Encoder

​2. Decoder

​Technical Specifications

​Compression Details

​Model Parameters

​How It Works

​Encoding Process

​Decoding Process

​Acoustic Compression

​Residual Vector Quantization (RVQ)

​Compression Efficiency

​Batch Processing

​Batch Encoding

​Batch Decoding

​Use Cases

​Performance Characteristics

​Latency

​Quality Metrics

​Advanced Usage

​Custom Quantizer Subsets

​Code Manipulation

​Integration with TTS Models

​Limitations

​Next Steps

Architecture

Voice Cloning

API Reference

Examples

Build docs developers (and LLMs) love

Overview

Key Features

Architecture

1. Encoder

2. Decoder

Technical Specifications

Compression Details

Model Parameters

How It Works

Encoding Process

Decoding Process

Acoustic Compression

Residual Vector Quantization (RVQ)

Compression Efficiency

Batch Processing

Batch Encoding

Batch Decoding

Use Cases

Performance Characteristics

Latency

Quality Metrics

Advanced Usage

Custom Quantizer Subsets

Code Manipulation

Integration with TTS Models

Limitations

Next Steps