Vocoder (HiFi-GAN)

Overview

Matcha-TTS uses HiFi-GAN as the neural vocoder to convert generated mel-spectrograms into high-quality audio waveforms. The vocoder is a separate model that runs after Matcha-TTS synthesis.

Available Vocoders

hifigan_T2_v1

Optimized for single-speaker LJSpeech model.

from matcha.hifigan.models import Generator
from matcha.hifigan.config import v1
from matcha.hifigan.env import AttrDict

h = AttrDict(v1)
vocoder = Generator(h)
vocoder.load_state_dict(
    torch.load("hifigan_T2_v1", map_location="cuda")["generator"]
)
vocoder.eval()
vocoder.remove_weight_norm()

Specifications:

Sample rate: 22,050 Hz
Mel channels: 80
Optimized for: LJSpeech dataset
Use with: matcha_ljspeech

hifigan_univ_v1

Universal vocoder for multi-speaker models.

vocoder = Generator(h)
vocoder.load_state_dict(
    torch.load("hifigan_univ_v1", map_location="cuda")["generator"]
)
vocoder.eval()
vocoder.remove_weight_norm()

Specifications:

Sample rate: 22,050 Hz
Mel channels: 80
Optimized for: Multi-speaker datasets (VCTK, etc.)
Use with: matcha_vctk and custom models

Generator Class

Generator

Main HiFi-GAN generator network.

from matcha.hifigan.models import Generator
from matcha.hifigan.env import AttrDict
from matcha.hifigan.config import v1

h = AttrDict(v1)
vocoder = Generator(h)

Configuration

h.resblock

str

required

Residual block type: “1” or “2”

h.upsample_rates

list[int]

required

Upsampling rates for each layer (e.g., [8, 8, 2, 2])

h.upsample_kernel_sizes

list[int]

required

Kernel sizes for upsampling layers

h.upsample_initial_channel

int

required

Number of channels after initial convolution (e.g., 512)

h.resblock_kernel_sizes

list[int]

required

Kernel sizes for residual blocks

h.resblock_dilation_sizes

list[list[int]]

required

Dilation rates for residual blocks

Methods

forward()

Generate waveform from mel-spectrogram.

def forward(x: torch.Tensor) -> torch.Tensor

Parameters:

torch.Tensor

required

Mel-spectrogram inputShape: (batch_size, 80, mel_length)

Returns:

waveform

torch.Tensor

Generated audio waveformShape: (batch_size, 1, audio_length)Audio length = mel_length * hop_length (typically 256)

remove_weight_norm()

Removes weight normalization for faster inference.

vocoder.remove_weight_norm()

Always call this after loading weights and before inference.

Denoiser

Post-processing denoiser to reduce artifacts.

from matcha.hifigan.denoiser import Denoiser

denoiser = Denoiser(vocoder, mode="zeros")

Parameters

vocoder

Generator

required

The HiFi-GAN generator model

mode

str

default:"zeros"

Denoising mode: “zeros” or “normal”

Methods

def __call__(
    audio: torch.Tensor,
    strength: float = 0.00025
) -> torch.Tensor

Parameters:

audio

torch.Tensor

required

Input audio waveformShape: (audio_length,) or (1, audio_length)

strength

float

default:"0.00025"

Denoising strength. Higher values = more denoising but may affect quality

Returns:

denoised_audio

torch.Tensor

Denoised audio waveform

Complete Usage Example

Loading Vocoder

import torch
from matcha.hifigan.models import Generator
from matcha.hifigan.config import v1
from matcha.hifigan.env import AttrDict
from matcha.hifigan.denoiser import Denoiser

def load_vocoder(checkpoint_path: str, device: str = "cuda"):
    """Load HiFi-GAN vocoder."""
    
    # Load configuration
    h = AttrDict(v1)
    
    # Create model
    vocoder = Generator(h).to(device)
    
    # Load weights
    state_dict = torch.load(checkpoint_path, map_location=device)
    vocoder.load_state_dict(state_dict["generator"])
    
    # Prepare for inference
    vocoder.eval()
    vocoder.remove_weight_norm()
    
    # Create denoiser
    denoiser = Denoiser(vocoder, mode="zeros")
    
    return vocoder, denoiser

# Usage
vocoder, denoiser = load_vocoder("hifigan_T2_v1")

Mel to Audio Conversion

import torch
import soundfile as sf

def mel_to_audio(
    mel: torch.Tensor,
    vocoder,
    denoiser=None,
    denoiser_strength: float = 0.00025
) -> torch.Tensor:
    """Convert mel-spectrogram to audio waveform."""
    
    with torch.inference_mode():
        # Generate audio
        audio = vocoder(mel)
        
        # Clamp to valid range
        audio = audio.clamp(-1, 1)
        
        # Apply denoising
        if denoiser is not None:
            audio = denoiser(
                audio.squeeze(),
                strength=denoiser_strength
            )
        
        # Move to CPU
        audio = audio.cpu().squeeze()
    
    return audio

# Usage
mel = torch.randn(1, 80, 100).cuda()  # Example mel
audio = mel_to_audio(mel, vocoder, denoiser)

# Save audio
sf.write("output.wav", audio.numpy(), 22050)

End-to-End Synthesis

import torch
from matcha.models.matcha_tts import MatchaTTS
from matcha.text import text_to_sequence
from matcha.utils.utils import intersperse

def synthesize(
    text: str,
    matcha_model,
    vocoder,
    denoiser=None,
    n_timesteps: int = 10,
    temperature: float = 0.667,
    length_scale: float = 1.0,
    denoiser_strength: float = 0.00025
):
    """Full text-to-speech synthesis."""
    
    device = next(matcha_model.parameters()).device
    
    # Preprocess text
    sequence, _ = text_to_sequence(text, ["english_cleaners2"])
    sequence = intersperse(sequence, 0)
    x = torch.LongTensor(sequence).unsqueeze(0).to(device)
    x_lengths = torch.LongTensor([len(sequence)]).to(device)
    
    # Generate mel-spectrogram
    output = matcha_model.synthesise(
        x=x,
        x_lengths=x_lengths,
        n_timesteps=n_timesteps,
        temperature=temperature,
        length_scale=length_scale
    )
    
    # Convert to audio
    mel = output["mel"]
    audio = vocoder(mel).clamp(-1, 1)
    
    if denoiser is not None:
        audio = denoiser(audio.squeeze(), strength=denoiser_strength)
    
    return {
        "audio": audio.cpu().squeeze().numpy(),
        "mel": mel.cpu().squeeze().numpy(),
        "rtf": output["rtf"]
    }

# Usage
result = synthesize(
    "Hello, this is a test.",
    matcha_model,
    vocoder,
    denoiser,
    n_timesteps=10
)

import soundfile as sf
sf.write("output.wav", result["audio"], 22050)
print(f"RTF: {result['rtf']:.4f}")

Model Architecture

HiFi-GAN uses a multi-scale architecture:

Input: 80-channel mel-spectrogram
Initial Conv: Expand to 512 channels
Upsampling: Multiple transposed convolutions (8x → 8x → 2x → 2x)
Residual Blocks: Multi-receptive field fusion (MRF)
Output: Single-channel waveform

Mel (80, T) → Conv1d → (512, T)
    ↓
Upsample (8x) → (512, 8T)
    ↓
Upsample (8x) → (256, 64T)
    ↓
Upsample (2x) → (128, 128T)
    ↓
Upsample (2x) → (64, 256T)
    ↓
Conv1d → Waveform (1, 256T)

Total upsampling: 8 × 8 × 2 × 2 = 256 (hop length)

Performance

Speed

On RTX 3090:

Single utterance: ~0.001 RTF (1000x faster than real-time)
Batch of 32: ~0.01 RTF

On CPU:

Single utterance: ~0.1 RTF (10x faster than real-time)

Memory

Model size: ~14 MB
GPU memory: ~200 MB
Inference: Minimal additional memory

Configuration Details

v1 Config

from matcha.hifigan.config import v1

print(v1)
# {
#     "resblock": "1",
#     "num_gpus": 0,
#     "batch_size": 16,
#     "learning_rate": 0.0002,
#     "adam_b1": 0.8,
#     "adam_b2": 0.99,
#     "lr_decay": 0.999,
#     "seed": 1234,
#     "upsample_rates": [8, 8, 2, 2],
#     "upsample_kernel_sizes": [16, 16, 4, 4],
#     "upsample_initial_channel": 512,
#     "resblock_kernel_sizes": [3, 7, 11],
#     "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
# }

Denoising Strategies

No Denoising

audio = vocoder(mel).clamp(-1, 1).cpu().squeeze()

Fastest but may have artifacts.

Light Denoising

audio = denoiser(audio, strength=0.00025)

Balanced quality and speed (recommended).

Heavy Denoising

audio = denoiser(audio, strength=0.001)

Cleaner but may affect naturalness.

Best Practices

Remove weight norm: Always call remove_weight_norm() before inference
Use matching vocoder: LJSpeech → hifigan_T2_v1, VCTK → hifigan_univ_v1
Clamp outputs: Always clamp audio to [-1, 1] range
Batch processing: Process multiple mels together for efficiency
Denoising: Use light denoising (0.00025) for best quality/speed trade-off

Custom Vocoder

To use a different vocoder:

class CustomVocoder:
    def __init__(self, checkpoint_path):
        # Load your vocoder
        pass
    
    def __call__(self, mel: torch.Tensor) -> torch.Tensor:
        # mel: (batch, 80, time)
        # return: (batch, 1, time * hop_length)
        pass

# Use with Matcha-TTS
vocoder = CustomVocoder("model.pt")
audio = vocoder(mel)

Troubleshooting

Artifacts in Audio

# Increase denoiser strength
audio = denoiser(audio, strength=0.0005)

Clipping/Distortion

# Ensure proper clamping
audio = vocoder(mel).clamp(-1, 1)

Slow Inference

# Check weight norm was removed
vocoder.remove_weight_norm()

# Use GPU
vocoder = vocoder.cuda()
mel = mel.cuda()

Source Reference

Implementation: matcha/hifigan/models.py:148 Denoiser: matcha/hifigan/denoiser.py Config: matcha/hifigan/config.py

Models

CLI Commands

Utilities

Overview

Available Vocoders

hifigan_T2_v1

hifigan_univ_v1

Generator Class

Generator

Configuration

Methods

forward()

remove_weight_norm()

Denoiser

Denoiser

Parameters

Methods

Complete Usage Example

Loading Vocoder

Mel to Audio Conversion

End-to-End Synthesis

Model Architecture

Performance

Speed

Memory

Configuration Details

v1 Config

Denoising Strategies

No Denoising

Light Denoising

Heavy Denoising

Best Practices

Custom Vocoder

Troubleshooting

Artifacts in Audio

Clipping/Distortion

Slow Inference

Source Reference

Build docs developers (and LLMs) love

Models

CLI Commands

Utilities

​Overview

​Available Vocoders

​hifigan_T2_v1

​hifigan_univ_v1

​Generator Class

​Generator

​Configuration

​Methods

​forward()

​remove_weight_norm()

​Denoiser

​Denoiser

​Parameters

​Methods

​Complete Usage Example

​Loading Vocoder

​Mel to Audio Conversion

​End-to-End Synthesis

​Model Architecture

​Performance

​Speed

​Memory

​Configuration Details

​v1 Config

​Denoising Strategies

​No Denoising

​Light Denoising

​Heavy Denoising

​Best Practices

​Custom Vocoder

​Troubleshooting

​Artifacts in Audio

​Clipping/Distortion

​Slow Inference

​Source Reference

Build docs developers (and LLMs) love

Overview

Available Vocoders

hifigan_T2_v1

hifigan_univ_v1

Generator Class

Generator

Configuration

Methods

forward()

remove_weight_norm()

Denoiser

Denoiser

Parameters

Methods

Complete Usage Example

Loading Vocoder

Mel to Audio Conversion

End-to-End Synthesis

Model Architecture

Performance

Speed

Memory

Configuration Details

v1 Config

Denoising Strategies

No Denoising

Light Denoising

Heavy Denoising

Best Practices

Custom Vocoder

Troubleshooting

Artifacts in Audio

Clipping/Distortion

Slow Inference

Source Reference