TextToSpeech

Overview

The TextToSpeech class provides an async interface to the Unmute TTS server. It streams text input and receives synthesized audio in real-time with precise timing control.

Class Definition

class TextToSpeech(ServiceWithStartup)

Constructor

def __init__(
    self,
    tts_instance: str = TTS_SERVER,
    recorder: Recorder | None = None,
    get_time: Callable[[], float] | None = None,
    voice: str | None = None,
)

Initializes the text-to-speech client.

tts_instance

str

default:"TTS_SERVER"

URL of the TTS server instance

recorder

Recorder | None

default:"None"

Optional recorder instance for logging TTS events

get_time

Callable[[], float] | None

default:"None"

Optional callback function to get current time (for synchronization)

voice

str | None

default:"None"

Voice identifier. Can be a preset voice name or “custom:” prefixed for custom voice embeddings

Properties

voice

voice: str | None

The currently configured voice identifier.

received_samples

received_samples: int

Total number of audio samples received from the TTS server.

received_samples_yielded

received_samples_yielded: int

Number of audio samples that have been yielded to the consumer (after buffering).

Core Methods

send

async def send(self, message: str | TTSClientMessage) -> None

Sends text or a message to the TTS server for synthesis.

message

str | TTSClientMessage

required

Text string or structured message to synthesize. Strings are automatically preprocessed to remove unpronounceable characters.

Notes:

Empty strings are ignored
String messages are preprocessed by prepare_text_for_tts()
TTSClientTextMessage bypasses preprocessing

start_up

async def start_up()

Establishes WebSocket connection to the TTS server and configures the voice. Raises:

MissingServiceAtCapacity: If TTS server is at capacity
AssertionError: If connection setup fails

Notes:

Sends custom voice embeddings if voice starts with “custom:”
Waits for TTSReadyMessage before considering startup complete

shutdown

async def shutdown()

Closes the WebSocket connection and records session metrics. Metrics recorded:

Active sessions count
Total audio duration
Generation duration

state

def state(self) -> WebsocketState

Returns the current WebSocket connection state. Returns: Literal["not_created", "connecting", "connected", "closing", "closed"]

Async Iterator

aiter

async def __aiter__(self) -> AsyncIterator[TTSMessage]

Iterates over synthesized audio and text alignment messages from the TTS server. Yields:

TTSAudioMessage: Synthesized audio chunks
TTSTextMessage: Text alignment with timing information

Notes:

Audio is buffered and released with AUDIO_BUFFER_SEC delay (approx. 160ms)
Text messages are synchronized with audio playback timing

Example:

async for message in tts:
    if isinstance(message, TTSAudioMessage):
        # Audio data as list of float32 PCM samples
        audio_array = np.array(message.pcm, dtype=np.float32)
        # Play or process audio...
    elif isinstance(message, TTSTextMessage):
        print(f"[{message.start_s:.2f}s - {message.stop_s:.2f}s] {message.text}")

Message Types

Client Messages (sent to server)

TTSClientTextMessage

class TTSClientTextMessage(BaseModel):
    type: Literal["Text"] = "Text"
    text: str

Text to synthesize.

TTSClientVoiceMessage

class TTSClientVoiceMessage(BaseModel):
    type: Literal["Voice"] = "Voice"
    embeddings: list[float]
    shape: list[int]

Custom voice embeddings.

TTSClientEosMessage

class TTSClientEosMessage(BaseModel):
    type: Literal["Eos"] = "Eos"

End of stream signal indicating no more text will be sent.

Server Messages (received from server)

TTSAudioMessage

class TTSAudioMessage(BaseModel):
    type: Literal["Audio"]
    pcm: list[float]

Synthesized audio chunk in PCM float32 format at 24kHz.

TTSTextMessage

class TTSTextMessage(BaseModel):
    type: Literal["Text"]
    text: str
    start_s: float
    stop_s: float

Text alignment information with timing.

TTSErrorMessage

class TTSErrorMessage(BaseModel):
    type: Literal["Error"]
    message: str

Error message from the server.

TTSReadyMessage

class TTSReadyMessage(BaseModel):
    type: Literal["Ready"]

Server ready signal.

Helper Functions

prepare_text_for_tts

def prepare_text_for_tts(text: str) -> str

Preprocesses text for better TTS pronunciation. Transformations:

Strips leading/trailing whitespace
Removes unpronounceable characters: *, _, `
Normalizes curly quotes to straight quotes
Removes spaces around colons

Example:

text = prepare_text_for_tts("What's *this* thing?")
# Result: "What's this thing?"

Example Usage

Basic Synthesis

import asyncio
from unmute.tts.text_to_speech import TextToSpeech, TTSClientEosMessage

async def synthesize_speech():
    tts = TextToSpeech(voice="alloy")
    
    try:
        await tts.start_up()
        print(f"TTS state: {tts.state()}")
        
        # Send text for synthesis
        await tts.send("Hello, world!")
        await tts.send("This is a test of the text to speech system.")
        await tts.send(TTSClientEosMessage())  # Signal end of input
        
        # Receive synthesized audio
        async for message in tts:
            if isinstance(message, TTSAudioMessage):
                print(f"Received {len(message.pcm)} audio samples")
            elif isinstance(message, TTSTextMessage):
                print(f"Text timing: '{message.text}' from {message.start_s}s to {message.stop_s}s")
    
    finally:
        await tts.shutdown()

asyncio.run(synthesize_speech())

Streaming Synthesis

async def stream_synthesis():
    tts = TextToSpeech()
    await tts.start_up()
    
    async def send_text():
        """Send text word by word"""
        words = "The quick brown fox jumps over the lazy dog".split()
        for word in words:
            await tts.send(word + " ")
            await asyncio.sleep(0.1)  # Simulate streaming
        await tts.send(TTSClientEosMessage())
    
    async def receive_audio():
        """Receive and process audio"""
        async for message in tts:
            if isinstance(message, TTSAudioMessage):
                # Process audio in real-time
                pass
    
    await asyncio.gather(send_text(), receive_audio())
    await tts.shutdown()

Custom Voice

async def use_custom_voice():
    # Custom voice embeddings must be pre-loaded in voice_embeddings_cache
    tts = TextToSpeech(voice="custom:my_voice")
    await tts.start_up()
    
    await tts.send("Speaking with a custom voice.")
    await tts.send(TTSClientEosMessage())
    
    async for message in tts:
        pass  # Process messages
    
    await tts.shutdown()

Configuration

TtsStreamingQuery

class TtsStreamingQuery(BaseModel):
    seed: int | None = None
    temperature: float | None = None
    top_k: int | None = None
    format: str = "PcmMessagePack"
    voice: str | None = None
    voices: list[str] | None = None
    max_seq_len: int | None = None
    cfg_alpha: float | None = None  # Default: 1.5 in code
    auth_id: str | None = None

Query parameters sent to the TTS server during connection.

Metrics

The class automatically tracks:

TTS_SESSIONS: Total TTS sessions
TTS_ACTIVE_SESSIONS: Active synthesis sessions
TTS_SENT_FRAMES: Text chunks sent
TTS_RECV_FRAMES: Audio chunks received
TTS_RECV_WORDS: Words with timing info received
TTS_TTFT: Time to first token (audio)
TTS_AUDIO_DURATION: Total audio generated
TTS_GEN_DURATION: Total generation time

Notes

Audio output is 24kHz PCM float32 format
Audio buffering is approximately 160ms (4 frames × 40ms)
Text preprocessing improves pronunciation quality
Custom voices require pre-cached embeddings
Messages are encoded using MessagePack format
Connection automatically closes when iteration completes

WebSocket API

Python API

REST API

Overview

Class Definition

Constructor

Properties

voice

received_samples

received_samples_yielded

Core Methods

send

start_up

shutdown

state

Async Iterator

aiter

Message Types

Client Messages (sent to server)

TTSClientTextMessage

TTSClientVoiceMessage

TTSClientEosMessage

Server Messages (received from server)

TTSAudioMessage

TTSTextMessage

TTSErrorMessage

TTSReadyMessage

Helper Functions

prepare_text_for_tts

Example Usage

Basic Synthesis

Streaming Synthesis

Custom Voice

Configuration

TtsStreamingQuery

Metrics

Notes

Build docs developers (and LLMs) love

WebSocket API

Python API

REST API

​Overview

​Class Definition

​Constructor

​Properties

​voice

​received_samples

​received_samples_yielded

​Core Methods

​send

​start_up

​shutdown

​state

​Async Iterator

​aiter

​Message Types

​Client Messages (sent to server)

​TTSClientTextMessage

​TTSClientVoiceMessage

​TTSClientEosMessage

​Server Messages (received from server)

​TTSAudioMessage

​TTSTextMessage

​TTSErrorMessage

​TTSReadyMessage

​Helper Functions

​prepare_text_for_tts

​Example Usage

​Basic Synthesis

​Streaming Synthesis

​Custom Voice

​Configuration

​TtsStreamingQuery

​Metrics

​Notes

Build docs developers (and LLMs) love

Overview

Class Definition

Constructor

Properties

voice

received_samples

received_samples_yielded

Core Methods

send

start_up

shutdown

state

Async Iterator

aiter

Message Types

Client Messages (sent to server)

TTSClientTextMessage

TTSClientVoiceMessage

TTSClientEosMessage

Server Messages (received from server)

TTSAudioMessage

TTSTextMessage

TTSErrorMessage

TTSReadyMessage

Helper Functions

prepare_text_for_tts

Example Usage

Basic Synthesis

Streaming Synthesis

Custom Voice

Configuration

TtsStreamingQuery

Metrics

Notes