Overview
Once you’ve exported a Matcha-TTS model to ONNX format, you can run inference using ONNX Runtime. This enables deployment on various platforms with optimized CPU and GPU execution.Installation
Install ONNX Runtime for inference:CPU Inference
GPU Inference
Basic Inference
Text Input
Synthesize speech from text:File Input
Synthesize from a text file:Command Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
model | str | Required | Path to ONNX model file |
--text | str | None | Text to synthesize |
--file | str | None | Path to text file to synthesize |
--vocoder | str | None | Path to external vocoder ONNX model |
--spk | int | None | Speaker ID for multi-speaker models |
--temperature | float | 0.667 | Variance of the noise (controls randomness) |
--speaking-rate | float | 1.0 | Speaking rate (higher = slower speech) |
--gpu | flag | False | Use GPU for inference |
--output-dir | str | Current dir | Output folder to save results |
Synthesis Parameters
Temperature
Controls the variance of the noise in the diffusion process:- Lower values (0.3-0.5): More deterministic, less variation
- Default (0.667): Balanced naturalness and variation
- Higher values (0.8-1.0): More variation, potentially less stable
Speaking Rate
Adjust the speed of generated speech:- < 1.0: Faster speech
- 1.0: Normal speed
- > 1.0: Slower speech
The speaking rate parameter works inversely: higher values result in slower speech.
GPU Inference
For faster inference on GPU:GPUExecutionProviderwhen--gpuis specifiedCPUExecutionProviderby default
Output Formats
With Embedded Vocoder
If the ONNX model has an embedded vocoder, waveforms are generated directly:output_1.wav,output_2.wav, etc. (PCM 24-bit, 22,050 Hz)
Without Embedded Vocoder
If the model only contains Matcha-TTS:output_1.npy- Mel-spectrogram as NumPy arrayoutput_1.png- Mel-spectrogram visualization
Using External Vocoder
Run full TTS pipeline with an external ONNX vocoder:output_1.wav,output_2.wav, etc.
Multi-Speaker Inference
For multi-speaker models, specify the speaker ID:Performance Metrics
The inference script automatically reports:- Inference seconds: Total time for generation
- Generated audio seconds: Duration of output audio
- RTF (Real-Time Factor): Ratio of inference time to audio duration
- RTF < 1.0: Faster than real-time
- RTF = 1.0: Real-time
- RTF > 1.0: Slower than real-time
Complete Example
Full synthesis with all parameters:Batch Processing
Process multiple lines from a file:output_1.wav- First sentenceoutput_2.wav- Second sentenceoutput_3.wav- Third sentence
Implementation Details
The inference process (matcha/onnx/infer.py:24-64):- Loads the ONNX model with specified execution provider
- Processes input text to phoneme sequences
- Pads sequences for batching
- Runs ONNX inference
- Optionally runs external vocoder
- Writes output files and reports performance metrics
Troubleshooting
GPU Not Being Used
Ensure:onnxruntime-gpuis installed- CUDA is properly configured
--gpuflag is passed to the command
Speaker ID Error
For multi-speaker models, ensure the speaker ID is within the valid range:- VCTK model: 0-107
Memory Issues
For large batch processing, reduce the number of sentences per run or use CPU inference.Next Steps
ONNX Export
Learn how to export models to ONNX
Pre-trained Models
Download and use pre-trained checkpoints