Skip to main content

Matcha Models

Matcha is a high-quality TTS model that uses an acoustic model + vocoder pipeline to produce very natural-sounding speech. It’s designed for applications where quality is the top priority.

Model Architecture

Matcha uses a two-stage architecture:
  • Acoustic Model (acoustic_model.onnx) – Generates mel-spectrogram from text
  • Vocoder (vocoder.onnx) – Converts mel-spectrogram to waveform
  • Tokens (tokens.txt) – Text token vocabulary
This separation allows for high-quality synthesis by using specialized neural networks for each stage.

When to Use

High-Quality Audio

When naturalness and quality are more important than speed

Audiobook Narration

Professional-quality narration for long-form content

Content Creation

Voiceovers for videos, podcasts, and media

Expressive Speech

Natural prosody and intonation

Supported Languages

Matcha models are available for:
  • English (primary focus)
  • Some multilingual variants
Check the download page for available languages.

Performance Characteristics

AspectRatingNotes
Streaming✅ SupportedStreaming generation available
Quality⭐⭐⭐⭐⭐Excellent, very natural-sounding
Speed⭐⭐⭐⭐Fast, but slower than VITS
Memory⭐⭐⭐Moderate (two models: acoustic + vocoder)
Model SizeMediumTypically 50-100 MB (acoustic + vocoder)

Matcha Models

Browse and download pretrained Matcha models

Configuration Example

Basic TTS

import { createTTS } from 'react-native-sherpa-onnx/tts';

const tts = await createTTS({
  modelPath: {
    type: 'asset',
    path: 'models/matcha-icefall-en'
  },
  modelType: 'matcha', // or 'auto'
  numThreads: 2,
});

const audio = await tts.generateSpeech('Hello, world!');
console.log('Generated:', audio.samples.length, 'samples at', audio.sampleRate, 'Hz');

await tts.destroy();

With Model Options

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/matcha-en' },
  modelType: 'matcha',
  modelOptions: {
    matcha: {
      noiseScale: 0.667,   // Voice variation
      lengthScale: 1.0,    // Speech speed
    }
  },
  numThreads: 2,
});

Streaming TTS

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/matcha-en' },
  modelType: 'matcha',
});

const sampleRate = await tts.getSampleRate();
await tts.startPcmPlayer(sampleRate, 1);

await tts.generateSpeechStream('High quality streaming speech', { sid: 0, speed: 1.0 }, {
  onChunk: async (chunk) => {
    await tts.writePcmChunk(chunk.samples);
  },
  onEnd: async () => {
    await tts.stopPcmPlayer();
  },
});

await tts.destroy();

Save to File

import { createTTS, saveAudioToFile } from 'react-native-sherpa-onnx/tts';

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/matcha-en' },
  modelType: 'matcha',
});

const audio = await tts.generateSpeech('Save this to a file');
await saveAudioToFile(audio, '/path/to/output.wav');

await tts.destroy();

Model Options

Matcha models support two tuning parameters:
OptionTypeDefaultDescription
noiseScalenumber0.667Controls voice variation and expressiveness. Range: 0.0-1.0
lengthScalenumber1.0Speech speed. < 1.0 = faster, > 1.0 = slower

Tuning Examples

// Clear, fast speech
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/matcha-en' },
  modelType: 'matcha',
  modelOptions: {
    matcha: {
      noiseScale: 0.4,    // Less variation
      lengthScale: 0.9,   // Slightly faster
    }
  },
});

// Expressive, natural speech
const tts2 = await createTTS({
  modelPath: { type: 'asset', path: 'models/matcha-en' },
  modelType: 'matcha',
  modelOptions: {
    matcha: {
      noiseScale: 0.8,    // More expressive
      lengthScale: 1.1,   // Slightly slower
    }
  },
});

Runtime Updates

const tts = await createTTS({ ... });

await tts.updateParams({
  modelOptions: {
    matcha: {
      noiseScale: 0.7,
      lengthScale: 1.2,
    }
  },
});

Model Detection

Matcha models are detected automatically by:
  • Presence of acoustic_model.onnx + vocoder.onnx
  • No folder name pattern required
Expected files:
  • acoustic_model.onnx
  • vocoder.onnx
  • tokens.txt

Performance Tips

Optimize Thread Count

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/matcha-en' },
  numThreads: 4, // More threads for faster generation
});

Use Streaming for Long Text

For better perceived performance:
const longText = 'Long audiobook paragraph...';

await tts.generateSpeechStream(longText, { sid: 0, speed: 1.0 }, {
  onChunk: async (chunk) => {
    await tts.writePcmChunk(chunk.samples); // Play while generating
  },
});

Hardware Acceleration

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/matcha-en' },
  provider: 'nnapi', // Android NNAPI
  // provider: 'xnnpack', // XNNPACK
});

Streaming Support

Streaming: ✅ YesMatcha models support streaming generation. Use generateSpeechStream() for incremental audio generation and low-latency playback.

Advantages

  1. Excellent Quality: Very natural-sounding speech
  2. Natural Prosody: Good intonation and rhythm
  3. Streaming: Supports incremental generation
  4. Acoustic Model + Vocoder: Flexible two-stage architecture
  5. Multi-Speaker: Some models support multiple speakers

Limitations

  1. Slower than VITS: Two-stage architecture is slightly slower
  2. Larger Size: Requires both acoustic model and vocoder
  3. No Voice Cloning: Cannot synthesize custom voices
  4. Limited Languages: Primarily English-focused

Use Cases

Audiobook Narration

Professional-quality long-form narration

Content Production

Voiceovers for videos and media

E-Learning

High-quality educational content

Podcasts

Natural-sounding podcast narration

Common Issues

  • Verify both acoustic_model.onnx and vocoder.onnx are present
  • Check that tokens.txt exists
  • Ensure sufficient device memory for both models
  • Increase numThreads on multi-core devices
  • Use hardware acceleration (provider: 'nnapi')
  • Consider using VITS for faster generation
  • Ensure no other heavy apps are running
  • Adjust noiseScale for more/less expressiveness
  • Try different lengthScale values
  • Ensure correct sample rate for playback
  • Check that vocoder output is not being resampled incorrectly

Comparison with Other Models

FeatureMatchaVITSZipvoiceKokoro
QualityVery HighHighVery HighHigh
SpeedFastVery FastMediumFast
StreamingYesYesNoYes
Voice CloningNoNoYesNo
Model SizeMediumSmallLargeSmall
ArchitectureAcoustic + VocoderEnd-to-EndEncoder + Decoder + VocoderEnd-to-End

Next Steps

TTS API

Detailed API documentation

Streaming TTS

Low-latency streaming guide

Model Setup

How to download and bundle models

Execution Providers

Hardware acceleration options

Build docs developers (and LLMs) love