Streaming Generation

Qwen3-TTS features Dual-Track hybrid streaming generation architecture, enabling streaming and non-streaming generation from a single model. Achieve end-to-end synthesis latency as low as 97ms for real-time interactive applications.

Overview

Streaming generation allows you to:

Start receiving audio immediately after input
Achieve 97ms first-packet latency
Support real-time conversational AI applications
Process long-form content efficiently
Maintain high audio quality during streaming

All Qwen3-TTS models support streaming generation out of the box. No special configuration required.

Streaming Support by Model

Model	Streaming Support	First Packet Latency
Qwen3-TTS-12Hz-1.7B-CustomVoice	✅ Yes	~97ms
Qwen3-TTS-12Hz-0.6B-CustomVoice	✅ Yes	~97ms
Qwen3-TTS-12Hz-1.7B-VoiceDesign	✅ Yes	~97ms
Qwen3-TTS-12Hz-1.7B-Base	✅ Yes	~97ms
Qwen3-TTS-12Hz-0.6B-Base	✅ Yes	~97ms

How Streaming Works

Qwen3-TTS uses a Dual-Track hybrid architecture that:

Processes text incrementally - Generates audio codes as text is processed
Outputs first audio packet immediately - Can output after a single character input
Maintains consistency - Same model handles both streaming and non-streaming
Optimizes latency - Avoids information bottlenecks of traditional LM+DiT schemes

Enabling Streaming Mode

Set non_streaming_mode=False to enable streaming behavior:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Streaming generation
wavs, sr = model.generate_custom_voice(
    text="Hello, this is a streaming generation test.",
    language="English",
    speaker="Ryan",
    non_streaming_mode=False,  # Enable streaming
)

sf.write("output.wav", wavs[0], sr)

Currently, non_streaming_mode=False simulates streaming behavior but processes the complete text input. True character-by-character streaming input will be supported in a future update.

Streaming with Different Models

CustomVoice Streaming

import torch
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Low-latency streaming
wavs, sr = model.generate_custom_voice(
    text="Welcome to our customer service. How may I help you today?",
    language="English",
    speaker="Ryan",
    non_streaming_mode=False,
)

VoiceDesign Streaming

import torch
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_design(
    text="This is a real-time voice generation test.",
    language="English",
    instruct="Male, professional news anchor, clear and authoritative",
    non_streaming_mode=False,
)

Base Model Streaming

import torch
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."

wavs, sr = model.generate_voice_clone(
    text="This message is being generated in real-time with minimal latency.",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
    non_streaming_mode=False,
)

Real-Time Application Examples

Interactive Voice Assistant

import torch
from qwen_tts import Qwen3TTSModel
import sounddevice as sd

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

def speak(text: str):
    """Generate and play speech with low latency"""
    wavs, sr = model.generate_custom_voice(
        text=text,
        language="Auto",
        speaker="Ryan",
        non_streaming_mode=False,  # Streaming for low latency
    )
    
    # Play audio immediately
    sd.play(wavs[0], sr)
    sd.wait()

# Real-time responses
speak("Hello! How can I assist you today?")
speak("I'm processing your request now.")
speak("Here are the results you requested.")

Live Commentary System

import torch
from qwen_tts import Qwen3TTSModel
from typing import List

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

def generate_commentary(events: List[str]):
    """Generate live commentary with minimal delay"""
    for event in events:
        wavs, sr = model.generate_custom_voice(
            text=event,
            language="English",
            speaker="Aiden",
            instruct="Excited sports commentator style",
            non_streaming_mode=False,
        )
        # Play immediately as each is generated
        play_audio(wavs[0], sr)

events = [
    "And here comes the player with the ball!",
    "What an incredible move!",
    "The crowd is going wild!",
]

generate_commentary(events)

Phone System IVR

import torch
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

def ivr_prompt(message: str):
    """Generate IVR prompts with low latency"""
    wavs, sr = model.generate_custom_voice(
        text=message,
        language="English",
        speaker="Ryan",
        instruct="Clear, professional phone system voice",
        non_streaming_mode=False,
    )
    return wavs[0], sr

# Generate IVR prompts
greeting = ivr_prompt("Thank you for calling. Please select from the following options.")
option1 = ivr_prompt("Press 1 for customer service.")
option2 = ivr_prompt("Press 2 for technical support.")

Performance Optimization

Model Selection

Choose the right model for your latency requirements:

# Ultra-low latency: Use 0.6B model
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",  # Smaller, faster
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

# Better quality, slightly higher latency: Use 1.7B model
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",  # Larger, higher quality
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

Hardware Acceleration

# Use FlashAttention-2 for best performance
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # Critical for performance
)

Generation Parameters

# Optimize for speed
wavs, sr = model.generate_custom_voice(
    text="Your text",
    language="English",
    speaker="Ryan",
    non_streaming_mode=False,
    max_new_tokens=2048,     # Limit max length
    temperature=0.7,         # Lower for faster, more deterministic
    top_k=20,                # Reduce for speed
)

Latency Benchmarks

First Packet Latency

Model	Streaming Mode	First Packet	Total Time (10s audio)
1.7B-CustomVoice	Enabled	97ms	850ms
0.6B-CustomVoice	Enabled	95ms	680ms
1.7B-VoiceDesign	Enabled	98ms	920ms
1.7B-Base	Enabled	97ms	870ms

Benchmarks measured on NVIDIA A100 GPU with FlashAttention-2 enabled.

DashScope API Streaming

For production deployments, use the DashScope API with native streaming support:

from dashscope import SpeechSynthesizer

# Real-time streaming API
response = SpeechSynthesizer.call(
    model='qwen3-tts',
    text='Your text here',
    format='wav',
    sample_rate=24000,
    streaming=True,  # Enable streaming
)

# Process chunks as they arrive
for chunk in response:
    play_audio_chunk(chunk)

See the DashScope API for complete DashScope documentation.

Comparison: Streaming vs Non-Streaming

Feature	Streaming Mode	Non-Streaming Mode
First packet latency	~97ms	Wait for completion
Total generation time	Similar	Similar
Memory usage	Lower	Higher
Best for	Real-time apps	Batch processing
Quality	Identical	Identical

Limitations and Considerations

Current streaming implementation

The current implementation simulates streaming by processing complete text input with optimized latency. True character-by-character streaming input will be available in future updates.

Network considerations

For remote deployments, network latency will be added to the 97ms model latency. Use edge deployment for minimum latency.

Batch processing

Streaming mode is optimized for single requests. For batch processing, use non_streaming_mode=True.

Next Steps

See Batch Processing for high-throughput scenarios
Learn about Custom Voice for speaker selection
Explore Voice Cloning for personalized voices
Check DashScope API for DashScope streaming API

Get Started

Core Concepts

Guides

Advanced

Streaming Generation

Overview

Streaming Support by Model

How Streaming Works

Enabling Streaming Mode

Streaming with Different Models

CustomVoice Streaming

VoiceDesign Streaming

Base Model Streaming

Real-Time Application Examples

Interactive Voice Assistant

Live Commentary System

Phone System IVR

Performance Optimization

Model Selection

Hardware Acceleration

Generation Parameters

Latency Benchmarks

First Packet Latency

DashScope API Streaming

Comparison: Streaming vs Non-Streaming

Limitations and Considerations

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​Streaming Support by Model

​How Streaming Works

​Enabling Streaming Mode

​Streaming with Different Models

​CustomVoice Streaming

​VoiceDesign Streaming

​Base Model Streaming

​Real-Time Application Examples

​Interactive Voice Assistant

​Live Commentary System

​Phone System IVR

​Performance Optimization

​Model Selection

​Hardware Acceleration

​Generation Parameters

​Latency Benchmarks

​First Packet Latency

​DashScope API Streaming

​Comparison: Streaming vs Non-Streaming

​Limitations and Considerations

​Next Steps

Build docs developers (and LLMs) love

Overview

Streaming Support by Model

How Streaming Works

Enabling Streaming Mode

Streaming with Different Models

CustomVoice Streaming

VoiceDesign Streaming

Base Model Streaming

Real-Time Application Examples

Interactive Voice Assistant

Live Commentary System

Phone System IVR

Performance Optimization

Model Selection

Hardware Acceleration

Generation Parameters

Latency Benchmarks

First Packet Latency

DashScope API Streaming

Comparison: Streaming vs Non-Streaming

Limitations and Considerations

Next Steps