Overview
The Qwen3-TTS-Tokenizer-12Hz is a neural audio codec that converts speech waveforms into discrete codes and reconstructs them back to audio. It serves as the foundation of the Qwen3-TTS system, enabling efficient acoustic compression and high-fidelity speech reconstruction.The tokenizer operates at 12Hz frame rate, meaning it processes audio at 12 frames per second, providing an optimal balance between compression efficiency and quality.
Key Features
Efficient Compression
Achieves ~1920x compression ratio (24kHz audio to 12Hz codes)
High Fidelity
Preserves paralinguistic information and acoustic details
Multi-Codebook
Uses 16 residual quantizers for fine-grained representation
Lightweight
Non-DiT architecture enables high-speed encoding/decoding
Architecture
The tokenizer consists of two main components:1. Encoder
Based on: Mimi architecture (modified Encodec) Function: Converts raw audio waveforms into discrete codes Configuration:- Input: 24kHz audio waveform
- Output: Discrete codes (codes_length, 16 quantizers)
- Codebook Size: 2048 entries per quantizer
- Frame Rate: 12Hz (1920 samples per frame)
- Quantizers: 16 residual vector quantizers
- Convolutional downsampling reduces audio to latent representation
- Residual Vector Quantization (RVQ) compresses latents to discrete codes
- Each of 16 quantizers captures different levels of acoustic detail
- Output is a matrix of shape
(T, 16)where T = duration_seconds × 12
2. Decoder
Architecture: Autoregressive transformer with sliding window attention Function: Reconstructs audio waveforms from discrete codes Configuration:- Hidden Size: 1024
- Layers: 8 transformer blocks
- Attention: Sliding window (window size = 72)
- Heads: 16 attention heads (16 key-value heads)
- Upsampling: Transposed convolutions with rates (2, 2)
- Output Rate: 24kHz audio
- Code embeddings are processed by transformer layers
- Sliding window attention captures local acoustic context
- ConvNeXt blocks refine intermediate representations
- Transposed convolutions upsample to original sample rate
- Final layer produces 24kHz audio waveform
Technical Specifications
Compression Details
| Parameter | Value |
|---|---|
| Input Sample Rate | 24,000 Hz |
| Output Sample Rate | 24,000 Hz |
| Frame Rate | 12 Hz |
| Samples per Frame | 2,000 |
| Compression Ratio | ~1920x |
| Number of Quantizers | 16 |
| Codebook Size | 2,048 per quantizer |
| Bits per Frame | 176 bits (16 quantizers × 11 bits) |
| Bitrate | ~2.1 kbps |
The 1920x compression is calculated as: 24000 samples/sec ÷ (12 frames/sec × 1 code/frame) ≈ 2000 samples per code.
Model Parameters
Encoder (from MimiConfig):- Based on Kyutai Mimi architecture
- Multi-scale discriminator for training
- Perceptual loss functions
- Hidden dimension: 1024
- Intermediate size: 3072
- Decoder output dimension: 1536
- Total parameters: ~100M
How It Works
Encoding Process
- Audio Loading: Input audio is loaded and resampled to 24kHz if needed
- Convolutional Encoding: Strided convolutions downsample audio to latent space
- Quantization: 16-layer residual vector quantization converts continuous latents to discrete codes
- Code Output: Each frame (1/12 second) becomes 16 discrete codes
- Supported Input Formats
- Output Format
- File path: Local audio file (WAV, MP3, FLAC, etc.)
- URL: HTTP/HTTPS URL to audio file
- NumPy array:
(samples,)or(batch, samples) - Tuple:
(numpy_array, sample_rate) - Base64: Base64-encoded audio string
Decoding Process
- Code Embedding: Each of 16 codes is embedded into continuous space
- Transformer Processing: Autoregressive transformer refines representations
- Sliding Window Attention: Captures local dependencies (window=72 frames)
- Upsampling: Transposed convolutions upsample to 24kHz
- Waveform Generation: Final layer produces high-quality audio
- Input Format
- Output Format
Accepts codes from encoder:
List[torch.LongTensor]: List of code tensorsdict:{"audio_codes": [...]}format- Shape:
(codes_length, 16)per item
Acoustic Compression
Residual Vector Quantization (RVQ)
The tokenizer uses 16 hierarchical quantizers to capture acoustic information:- First Quantizer: Approximates latent vector using 2048 codebook entries
- Residual Calculation: Computes difference between approximation and target
- Second Quantizer: Quantizes the residual
- Iteration: Repeats for all 16 layers
- Reconstruction: Sum of all quantized layers reconstructs original signal
Each quantizer captures progressively finer details. Early quantizers (Q0-Q3) are most important for intelligibility, while later quantizers (Q8-Q15) preserve naturalness and speaker characteristics.
Compression Efficiency
For a 10-second audio clip:| Format | Size | Compression vs. WAV |
|---|---|---|
| Raw WAV (24kHz, 16-bit) | 480 KB | 1x (baseline) |
| MP3 (128 kbps) | 160 KB | 3x |
| Qwen3 Codes | 2.64 KB | 182x |
The codes are stored as integers, not compressed audio. For transmission, codes can be further compressed with standard algorithms (gzip, etc.) for additional 2-3x reduction.
Batch Processing
Batch Encoding
Batch Decoding
Use Cases
Audio Compression & Storage
Audio Compression & Storage
Compress audio files to discrete codes for efficient storage:Achieves ~182x compression vs. WAV, ~60x vs. MP3.
Network Transmission
Network Transmission
Stream audio over networks with minimal bandwidth:
Training Data Preparation
Training Data Preparation
Pre-encode audio datasets for faster TTS training:
Speech Codec Research
Speech Codec Research
Analyze speech representations and acoustic properties:
Performance Characteristics
Latency
| Operation | Latency (GPU) | Latency (CPU) |
|---|---|---|
| Encode 1s | ~10-15ms | ~100-150ms |
| Decode 1s | ~20-30ms | ~200-300ms |
| Round-trip | ~30-45ms | ~300-450ms |
Latencies measured on NVIDIA A100 (GPU) and Intel Xeon (CPU). Actual times vary by hardware.
Quality Metrics
| Metric | Value |
|---|---|
| PESQ (Perceptual Quality) | 4.2-4.4 |
| MOS (Mean Opinion Score) | 4.3-4.5 |
| Speaker Similarity | >0.85 |
| Word Error Rate | Less than 1% (with good ASR) |
Quality is comparable to high-bitrate traditional codecs (128 kbps MP3) despite ~60x lower bitrate.
Advanced Usage
Custom Quantizer Subsets
Code Manipulation
Integration with TTS Models
The tokenizer works seamlessly with TTS models:TTS models generate codes in the same format as the tokenizer encoder, enabling seamless integration.
Limitations
Next Steps
Architecture
Learn how the tokenizer fits into the overall system
Voice Cloning
Use the tokenizer for voice cloning applications
API Reference
Detailed API documentation for the tokenizer
Examples
See practical examples and code snippets