Overview
Matcha-TTS supports exporting trained checkpoints to ONNX format, enabling deployment on various platforms and inference engines. The export process converts the PyTorch model to an optimized ONNX graph with configurable parameters.Special thanks to @mush42 for implementing ONNX export and inference support.
Installation
Before exporting to ONNX, install the required dependencies:Basic Export
Export a Matcha-TTS checkpoint to ONNX format:Command Arguments
| Argument | Type | Required | Description |
|---|---|---|---|
checkpoint_path | str | Yes | Path to the model checkpoint |
output | str | Yes | Path to output .onnx file |
--n-timesteps | int | No | Number of steps for reverse diffusion (default: 5) |
--vocoder-name | str | No | Name of vocoder to embed in ONNX graph |
--vocoder-checkpoint-path | str | No | Path to vocoder checkpoint |
--opset | int | No | ONNX opset version (default: 15) |
Export with Embedded Vocoder
You can embed a vocoder in the exported ONNX graph to enable end-to-end waveform generation:Available Vocoders
The following vocoders are supported for embedding (from matcha/cli.py:25-28):hifigan_T2_v1- HiFi-GAN trained on LJ Speechhifigan_univ_v1- Universal HiFi-GAN for multi-speaker models
When embedding a vocoder, both
--vocoder-name and --vocoder-checkpoint-path arguments are required.Export Behavior
Model Inputs
The exported ONNX model accepts the following inputs:x- Phoneme sequence (int64, shape: [batch_size, time])x_lengths- Length of each sequence (int64, shape: [batch_size])scales- Temperature and length scale (float32, shape: [2])spks- Speaker IDs for multi-speaker models (int64, shape: [batch_size])
Model Outputs
Without vocoder:mel- Mel-spectrogram (float32, shape: [batch_size, 80, time])mel_lengths- Mel-spectrogram lengths (int64, shape: [batch_size])
wav- Waveform audio (float32, shape: [batch_size, time])wav_lengths- Waveform lengths (int64, shape: [batch_size])
Dynamic Shapes
The exported ONNX model supports dynamic batch sizes and sequence lengths, making it flexible for various deployment scenarios. Dynamic axes are configured for:- Batch dimension (axis 0) for all inputs and outputs
- Time dimension for sequence inputs and generated outputs
Timesteps Configuration
The number of ODE solver steps affects quality and speed:- Lower values (2-5): Faster inference, slightly lower quality
- Higher values (10-20): Better quality, slower inference
- Recommended: 5 timesteps for most use cases
Multi-Speaker Model Export
When exporting a multi-speaker model (matcha/onnx/export.py:134), the ONNX graph automatically includes thespks input:
matcha.n_spks > 1 and adds the appropriate input nodes.
Export Implementation Details
The export process (matcha/onnx/export.py:35-60):- Loads the Matcha-TTS checkpoint
- Optionally loads and combines with vocoder
- Creates dummy inputs for tracing
- Monkey-patches the forward function to accept scale parameters as tensors
- Exports to ONNX with dynamic axes and constant folding optimizations
Next Steps
ONNX Inference
Run inference on exported ONNX models
Multi-Speaker Setup
Configure multi-speaker TTS models