FunASR MLX
FunASR speech recognition on Apple Silicon using MLX. Provides GPU-accelerated Chinese speech recognition using the Paraformer-large model from FunASR, optimized for Apple Silicon via MLX.Features
- Non-autoregressive ASR: Predicts all tokens in parallel (18x+ real-time)
- Pure Rust: No Python dependencies at runtime
- GPU Accelerated: Metal GPU via MLX for all operations
- High Quality: FunASR-compatible audio preprocessing
Architecture
The Paraformer model consists of:- Mel Frontend: 80-bin mel spectrogram with LFR stacking (7 frames, stride 6)
- SAN-M Encoder: 50-layer self-attention with FSMN memory enhancement
- CIF Predictor: Continuous integrate-and-fire for acoustic alignment
- Bidirectional Decoder: 16-layer transformer decoder
Installation
Quick start
Functions
load_model
Load a Paraformer model from a safetensors file.Path to paraformer.safetensors file
Loaded Paraformer model instance
load_model_with_config
Load a Paraformer model with custom configuration.Path to model safetensors file
Custom model configuration
Loaded model with custom config
parse_cmvn_file
Parse CMVN (Cepstral Mean and Variance Normalization) file.Path to am.mvn file from FunASR model
Tuple of (addshift, rescale) vectors for normalization
transcribe
High-level transcription function.Loaded Paraformer model with CMVN set
Audio samples as f32 in range [-1, 1]
Vocabulary for decoding token IDs to text
Transcribed Chinese text
transcribe_with_punctuation
Transcribe audio and apply punctuation restoration.Loaded Paraformer model
Audio samples
Vocabulary for decoding
CT-Transformer punctuation model
Transcribed text with punctuation restored
transcribe but passes result through CT-Transformer punctuation model.
Paraformer
Main model struct for FunASR Paraformer.Paraformer::transcribe
Transcribe audio samples to token IDs.Audio samples as MLX Array, shape [num_samples], 16kHz mono
Token IDs as MLX Array, shape [num_tokens]
Paraformer::set_cmvn
Set CMVN normalization parameters.Additive shift for mean normalization
Multiplicative rescale for variance normalization
Vocabulary
Vocabulary for decoding token IDs to text.Vocabulary::load
Load vocabulary from a text file.Path to tokens.txt or vocab.txt (one token per line)
Loaded vocabulary with 8404 tokens
Vocabulary::decode
Decode token IDs to text.Array of token IDs from model output
Decoded text with special tokens filtered
<blank>, <s>, </s>, <unk>, <pad>.
Vocabulary::len
Get the number of tokens in vocabulary.Number of tokens (typically 8404)
MelFrontend
Mel spectrogram frontend for audio preprocessing.MelFrontend::new
Create a new mel frontend.Model configuration
Mel frontend with 80-bin mel filters and FFT planner
Types
ParaformerConfig
Configuration for Paraformer model.Error
Error type for FunASR operations.Model files
You need to download and convert the FunASR Paraformer-large model:- Weights:
paraformer.safetensors(converted from FunASR PyTorch) - CMVN:
am.mvn(from FunASR model directory) - Vocabulary:
tokens.txtorvocab.txt(8404 tokens)
Audio preprocessing
Audio requirements
- Sample rate: 16kHz
- Format: Mono, f32 samples in range [-1, 1]
- Processing: Mel spectrogram with LFR (Low Frame Rate) stacking