Overview
The TextToSpeech class provides an async interface to the Unmute TTS server. It streams text input and receives synthesized audio in real-time with precise timing control.
Class Definition
class TextToSpeech(ServiceWithStartup)
Constructor
def __init__(
self,
tts_instance: str = TTS_SERVER,
recorder: Recorder | None = None,
get_time: Callable[[], float] | None = None,
voice: str | None = None,
)
Initializes the text-to-speech client.
URL of the TTS server instance
recorder
Recorder | None
default:"None"
Optional recorder instance for logging TTS events
get_time
Callable[[], float] | None
default:"None"
Optional callback function to get current time (for synchronization)
Voice identifier. Can be a preset voice name or “custom:” prefixed for custom voice embeddings
Properties
voice
The currently configured voice identifier.
received_samples
Total number of audio samples received from the TTS server.
received_samples_yielded
received_samples_yielded: int
Number of audio samples that have been yielded to the consumer (after buffering).
Core Methods
send
async def send(self, message: str | TTSClientMessage) -> None
Sends text or a message to the TTS server for synthesis.
message
str | TTSClientMessage
required
Text string or structured message to synthesize. Strings are automatically preprocessed to remove unpronounceable characters.
Notes:
- Empty strings are ignored
- String messages are preprocessed by
prepare_text_for_tts()
TTSClientTextMessage bypasses preprocessing
start_up
Establishes WebSocket connection to the TTS server and configures the voice.
Raises:
MissingServiceAtCapacity: If TTS server is at capacity
AssertionError: If connection setup fails
Notes:
- Sends custom voice embeddings if voice starts with “custom:”
- Waits for
TTSReadyMessage before considering startup complete
shutdown
Closes the WebSocket connection and records session metrics.
Metrics recorded:
- Active sessions count
- Total audio duration
- Generation duration
state
def state(self) -> WebsocketState
Returns the current WebSocket connection state.
Returns: Literal["not_created", "connecting", "connected", "closing", "closed"]
Async Iterator
aiter
async def __aiter__(self) -> AsyncIterator[TTSMessage]
Iterates over synthesized audio and text alignment messages from the TTS server.
Yields:
TTSAudioMessage: Synthesized audio chunks
TTSTextMessage: Text alignment with timing information
Notes:
- Audio is buffered and released with
AUDIO_BUFFER_SEC delay (approx. 160ms)
- Text messages are synchronized with audio playback timing
Example:
async for message in tts:
if isinstance(message, TTSAudioMessage):
# Audio data as list of float32 PCM samples
audio_array = np.array(message.pcm, dtype=np.float32)
# Play or process audio...
elif isinstance(message, TTSTextMessage):
print(f"[{message.start_s:.2f}s - {message.stop_s:.2f}s] {message.text}")
Message Types
Client Messages (sent to server)
TTSClientTextMessage
class TTSClientTextMessage(BaseModel):
type: Literal["Text"] = "Text"
text: str
Text to synthesize.
TTSClientVoiceMessage
class TTSClientVoiceMessage(BaseModel):
type: Literal["Voice"] = "Voice"
embeddings: list[float]
shape: list[int]
Custom voice embeddings.
TTSClientEosMessage
class TTSClientEosMessage(BaseModel):
type: Literal["Eos"] = "Eos"
End of stream signal indicating no more text will be sent.
Server Messages (received from server)
TTSAudioMessage
class TTSAudioMessage(BaseModel):
type: Literal["Audio"]
pcm: list[float]
Synthesized audio chunk in PCM float32 format at 24kHz.
TTSTextMessage
class TTSTextMessage(BaseModel):
type: Literal["Text"]
text: str
start_s: float
stop_s: float
Text alignment information with timing.
TTSErrorMessage
class TTSErrorMessage(BaseModel):
type: Literal["Error"]
message: str
Error message from the server.
TTSReadyMessage
class TTSReadyMessage(BaseModel):
type: Literal["Ready"]
Server ready signal.
Helper Functions
prepare_text_for_tts
def prepare_text_for_tts(text: str) -> str
Preprocesses text for better TTS pronunciation.
Transformations:
- Strips leading/trailing whitespace
- Removes unpronounceable characters:
*, _, `
- Normalizes curly quotes to straight quotes
- Removes spaces around colons
Example:
text = prepare_text_for_tts("What's *this* thing?")
# Result: "What's this thing?"
Example Usage
Basic Synthesis
import asyncio
from unmute.tts.text_to_speech import TextToSpeech, TTSClientEosMessage
async def synthesize_speech():
tts = TextToSpeech(voice="alloy")
try:
await tts.start_up()
print(f"TTS state: {tts.state()}")
# Send text for synthesis
await tts.send("Hello, world!")
await tts.send("This is a test of the text to speech system.")
await tts.send(TTSClientEosMessage()) # Signal end of input
# Receive synthesized audio
async for message in tts:
if isinstance(message, TTSAudioMessage):
print(f"Received {len(message.pcm)} audio samples")
elif isinstance(message, TTSTextMessage):
print(f"Text timing: '{message.text}' from {message.start_s}s to {message.stop_s}s")
finally:
await tts.shutdown()
asyncio.run(synthesize_speech())
Streaming Synthesis
async def stream_synthesis():
tts = TextToSpeech()
await tts.start_up()
async def send_text():
"""Send text word by word"""
words = "The quick brown fox jumps over the lazy dog".split()
for word in words:
await tts.send(word + " ")
await asyncio.sleep(0.1) # Simulate streaming
await tts.send(TTSClientEosMessage())
async def receive_audio():
"""Receive and process audio"""
async for message in tts:
if isinstance(message, TTSAudioMessage):
# Process audio in real-time
pass
await asyncio.gather(send_text(), receive_audio())
await tts.shutdown()
Custom Voice
async def use_custom_voice():
# Custom voice embeddings must be pre-loaded in voice_embeddings_cache
tts = TextToSpeech(voice="custom:my_voice")
await tts.start_up()
await tts.send("Speaking with a custom voice.")
await tts.send(TTSClientEosMessage())
async for message in tts:
pass # Process messages
await tts.shutdown()
Configuration
TtsStreamingQuery
class TtsStreamingQuery(BaseModel):
seed: int | None = None
temperature: float | None = None
top_k: int | None = None
format: str = "PcmMessagePack"
voice: str | None = None
voices: list[str] | None = None
max_seq_len: int | None = None
cfg_alpha: float | None = None # Default: 1.5 in code
auth_id: str | None = None
Query parameters sent to the TTS server during connection.
Metrics
The class automatically tracks:
TTS_SESSIONS: Total TTS sessions
TTS_ACTIVE_SESSIONS: Active synthesis sessions
TTS_SENT_FRAMES: Text chunks sent
TTS_RECV_FRAMES: Audio chunks received
TTS_RECV_WORDS: Words with timing info received
TTS_TTFT: Time to first token (audio)
TTS_AUDIO_DURATION: Total audio generated
TTS_GEN_DURATION: Total generation time
Notes
- Audio output is 24kHz PCM float32 format
- Audio buffering is approximately 160ms (4 frames × 40ms)
- Text preprocessing improves pronunciation quality
- Custom voices require pre-cached embeddings
- Messages are encoded using MessagePack format
- Connection automatically closes when iteration completes