Skip to main content

Overview

The TextToSpeech class provides an async interface to the Unmute TTS server. It streams text input and receives synthesized audio in real-time with precise timing control.

Class Definition

class TextToSpeech(ServiceWithStartup)

Constructor

def __init__(
    self,
    tts_instance: str = TTS_SERVER,
    recorder: Recorder | None = None,
    get_time: Callable[[], float] | None = None,
    voice: str | None = None,
)
Initializes the text-to-speech client.
tts_instance
str
default:"TTS_SERVER"
URL of the TTS server instance
recorder
Recorder | None
default:"None"
Optional recorder instance for logging TTS events
get_time
Callable[[], float] | None
default:"None"
Optional callback function to get current time (for synchronization)
voice
str | None
default:"None"
Voice identifier. Can be a preset voice name or “custom:” prefixed for custom voice embeddings

Properties

voice

voice: str | None
The currently configured voice identifier.

received_samples

received_samples: int
Total number of audio samples received from the TTS server.

received_samples_yielded

received_samples_yielded: int
Number of audio samples that have been yielded to the consumer (after buffering).

Core Methods

send

async def send(self, message: str | TTSClientMessage) -> None
Sends text or a message to the TTS server for synthesis.
message
str | TTSClientMessage
required
Text string or structured message to synthesize. Strings are automatically preprocessed to remove unpronounceable characters.
Notes:
  • Empty strings are ignored
  • String messages are preprocessed by prepare_text_for_tts()
  • TTSClientTextMessage bypasses preprocessing

start_up

async def start_up()
Establishes WebSocket connection to the TTS server and configures the voice. Raises:
  • MissingServiceAtCapacity: If TTS server is at capacity
  • AssertionError: If connection setup fails
Notes:
  • Sends custom voice embeddings if voice starts with “custom:”
  • Waits for TTSReadyMessage before considering startup complete

shutdown

async def shutdown()
Closes the WebSocket connection and records session metrics. Metrics recorded:
  • Active sessions count
  • Total audio duration
  • Generation duration

state

def state(self) -> WebsocketState
Returns the current WebSocket connection state. Returns: Literal["not_created", "connecting", "connected", "closing", "closed"]

Async Iterator

aiter

async def __aiter__(self) -> AsyncIterator[TTSMessage]
Iterates over synthesized audio and text alignment messages from the TTS server. Yields:
  • TTSAudioMessage: Synthesized audio chunks
  • TTSTextMessage: Text alignment with timing information
Notes:
  • Audio is buffered and released with AUDIO_BUFFER_SEC delay (approx. 160ms)
  • Text messages are synchronized with audio playback timing
Example:
async for message in tts:
    if isinstance(message, TTSAudioMessage):
        # Audio data as list of float32 PCM samples
        audio_array = np.array(message.pcm, dtype=np.float32)
        # Play or process audio...
    elif isinstance(message, TTSTextMessage):
        print(f"[{message.start_s:.2f}s - {message.stop_s:.2f}s] {message.text}")

Message Types

Client Messages (sent to server)

TTSClientTextMessage

class TTSClientTextMessage(BaseModel):
    type: Literal["Text"] = "Text"
    text: str
Text to synthesize.

TTSClientVoiceMessage

class TTSClientVoiceMessage(BaseModel):
    type: Literal["Voice"] = "Voice"
    embeddings: list[float]
    shape: list[int]
Custom voice embeddings.

TTSClientEosMessage

class TTSClientEosMessage(BaseModel):
    type: Literal["Eos"] = "Eos"
End of stream signal indicating no more text will be sent.

Server Messages (received from server)

TTSAudioMessage

class TTSAudioMessage(BaseModel):
    type: Literal["Audio"]
    pcm: list[float]
Synthesized audio chunk in PCM float32 format at 24kHz.

TTSTextMessage

class TTSTextMessage(BaseModel):
    type: Literal["Text"]
    text: str
    start_s: float
    stop_s: float
Text alignment information with timing.

TTSErrorMessage

class TTSErrorMessage(BaseModel):
    type: Literal["Error"]
    message: str
Error message from the server.

TTSReadyMessage

class TTSReadyMessage(BaseModel):
    type: Literal["Ready"]
Server ready signal.

Helper Functions

prepare_text_for_tts

def prepare_text_for_tts(text: str) -> str
Preprocesses text for better TTS pronunciation. Transformations:
  • Strips leading/trailing whitespace
  • Removes unpronounceable characters: *, _, `
  • Normalizes curly quotes to straight quotes
  • Removes spaces around colons
Example:
text = prepare_text_for_tts("What's *this* thing?")
# Result: "What's this thing?"

Example Usage

Basic Synthesis

import asyncio
from unmute.tts.text_to_speech import TextToSpeech, TTSClientEosMessage

async def synthesize_speech():
    tts = TextToSpeech(voice="alloy")
    
    try:
        await tts.start_up()
        print(f"TTS state: {tts.state()}")
        
        # Send text for synthesis
        await tts.send("Hello, world!")
        await tts.send("This is a test of the text to speech system.")
        await tts.send(TTSClientEosMessage())  # Signal end of input
        
        # Receive synthesized audio
        async for message in tts:
            if isinstance(message, TTSAudioMessage):
                print(f"Received {len(message.pcm)} audio samples")
            elif isinstance(message, TTSTextMessage):
                print(f"Text timing: '{message.text}' from {message.start_s}s to {message.stop_s}s")
    
    finally:
        await tts.shutdown()

asyncio.run(synthesize_speech())

Streaming Synthesis

async def stream_synthesis():
    tts = TextToSpeech()
    await tts.start_up()
    
    async def send_text():
        """Send text word by word"""
        words = "The quick brown fox jumps over the lazy dog".split()
        for word in words:
            await tts.send(word + " ")
            await asyncio.sleep(0.1)  # Simulate streaming
        await tts.send(TTSClientEosMessage())
    
    async def receive_audio():
        """Receive and process audio"""
        async for message in tts:
            if isinstance(message, TTSAudioMessage):
                # Process audio in real-time
                pass
    
    await asyncio.gather(send_text(), receive_audio())
    await tts.shutdown()

Custom Voice

async def use_custom_voice():
    # Custom voice embeddings must be pre-loaded in voice_embeddings_cache
    tts = TextToSpeech(voice="custom:my_voice")
    await tts.start_up()
    
    await tts.send("Speaking with a custom voice.")
    await tts.send(TTSClientEosMessage())
    
    async for message in tts:
        pass  # Process messages
    
    await tts.shutdown()

Configuration

TtsStreamingQuery

class TtsStreamingQuery(BaseModel):
    seed: int | None = None
    temperature: float | None = None
    top_k: int | None = None
    format: str = "PcmMessagePack"
    voice: str | None = None
    voices: list[str] | None = None
    max_seq_len: int | None = None
    cfg_alpha: float | None = None  # Default: 1.5 in code
    auth_id: str | None = None
Query parameters sent to the TTS server during connection.

Metrics

The class automatically tracks:
  • TTS_SESSIONS: Total TTS sessions
  • TTS_ACTIVE_SESSIONS: Active synthesis sessions
  • TTS_SENT_FRAMES: Text chunks sent
  • TTS_RECV_FRAMES: Audio chunks received
  • TTS_RECV_WORDS: Words with timing info received
  • TTS_TTFT: Time to first token (audio)
  • TTS_AUDIO_DURATION: Total audio generated
  • TTS_GEN_DURATION: Total generation time

Notes

  • Audio output is 24kHz PCM float32 format
  • Audio buffering is approximately 160ms (4 frames × 40ms)
  • Text preprocessing improves pronunciation quality
  • Custom voices require pre-cached embeddings
  • Messages are encoded using MessagePack format
  • Connection automatically closes when iteration completes

Build docs developers (and LLMs) love