Ultra-low latency streaming with 97ms first-packet latency
Qwen3-TTS features Dual-Track hybrid streaming generation architecture, enabling streaming and non-streaming generation from a single model. Achieve end-to-end synthesis latency as low as 97ms for real-time interactive applications.
Set non_streaming_mode=False to enable streaming behavior:
import torchimport soundfile as sffrom qwen_tts import Qwen3TTSModelmodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)# Streaming generationwavs, sr = model.generate_custom_voice( text="Hello, this is a streaming generation test.", language="English", speaker="Ryan", non_streaming_mode=False, # Enable streaming)sf.write("output.wav", wavs[0], sr)
Currently, non_streaming_mode=False simulates streaming behavior but processes the complete text input. True character-by-character streaming input will be supported in a future update.
import torchfrom qwen_tts import Qwen3TTSModelmodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)# Low-latency streamingwavs, sr = model.generate_custom_voice( text="Welcome to our customer service. How may I help you today?", language="English", speaker="Ryan", non_streaming_mode=False,)
import torchfrom qwen_tts import Qwen3TTSModelmodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."wavs, sr = model.generate_voice_clone( text="This message is being generated in real-time with minimal latency.", language="English", ref_audio=ref_audio, ref_text=ref_text, non_streaming_mode=False,)
import torchfrom qwen_tts import Qwen3TTSModelimport sounddevice as sdmodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)def speak(text: str): """Generate and play speech with low latency""" wavs, sr = model.generate_custom_voice( text=text, language="Auto", speaker="Ryan", non_streaming_mode=False, # Streaming for low latency ) # Play audio immediately sd.play(wavs[0], sr) sd.wait()# Real-time responsesspeak("Hello! How can I assist you today?")speak("I'm processing your request now.")speak("Here are the results you requested.")
import torchfrom qwen_tts import Qwen3TTSModelfrom typing import Listmodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)def generate_commentary(events: List[str]): """Generate live commentary with minimal delay""" for event in events: wavs, sr = model.generate_custom_voice( text=event, language="English", speaker="Aiden", instruct="Excited sports commentator style", non_streaming_mode=False, ) # Play immediately as each is generated play_audio(wavs[0], sr)events = [ "And here comes the player with the ball!", "What an incredible move!", "The crowd is going wild!",]generate_commentary(events)
# Use FlashAttention-2 for best performancemodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", # Critical for performance)
# Optimize for speedwavs, sr = model.generate_custom_voice( text="Your text", language="English", speaker="Ryan", non_streaming_mode=False, max_new_tokens=2048, # Limit max length temperature=0.7, # Lower for faster, more deterministic top_k=20, # Reduce for speed)
For production deployments, use the DashScope API with native streaming support:
from dashscope import SpeechSynthesizer# Real-time streaming APIresponse = SpeechSynthesizer.call( model='qwen3-tts', text='Your text here', format='wav', sample_rate=24000, streaming=True, # Enable streaming)# Process chunks as they arrivefor chunk in response: play_audio_chunk(chunk)
See the DashScope API for complete DashScope documentation.
The current implementation simulates streaming by processing complete text input with optimized latency. True character-by-character streaming input will be available in future updates.
Network considerations
For remote deployments, network latency will be added to the 97ms model latency. Use edge deployment for minimum latency.
Batch processing
Streaming mode is optimized for single requests. For batch processing, use non_streaming_mode=True.