NeMo CTC Models

NeMo CTC models are developed by NVIDIA and provide excellent performance for English speech recognition. They use Connectionist Temporal Classification (CTC) for fast, streaming-capable recognition.

Model Architecture

NeMo CTC models use a simple, efficient architecture:

Model (model.onnx or model.int8.onnx) – Single neural network
Tokens (tokens.txt) – Token vocabulary

CTC models are faster than encoder-decoder models because they don’t require autoregressive decoding.

When to Use

English Streaming

Real-time English transcription with low latency

Live Captions

English subtitles for videos or meetings

Fast Recognition

Quick batch transcription of English audio

Voice Assistants

English voice interfaces and commands

Supported Languages

NeMo CTC models are primarily designed for:

English (US, UK, and other variants)
Some multilingual variants available (check download page)

For other languages, consider Whisper, Paraformer (Chinese), or multilingual transducer models.

Performance Characteristics

Aspect	Rating	Notes
Streaming	✅ Excellent	Native streaming support with low latency
Accuracy	⭐⭐⭐⭐⭐	Very high accuracy for English
Speed	⭐⭐⭐⭐⭐	Fast CTC decoding
Memory	⭐⭐⭐⭐⭐	Low memory footprint
Model Size	Small-Medium	Typically 50-150 MB

Download Links

NeMo CTC Models

Browse and download pretrained NeMo CTC models

Configuration Example

Offline Transcription

import { createSTT } from 'react-native-sherpa-onnx/stt';

const stt = await createSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-nemo-ctc-en-citrinet-512'
  },
  modelType: 'nemo_ctc', // or 'auto'
  preferInt8: true,
  numThreads: 2,
});

const result = await stt.transcribeFile('/path/to/audio.wav');
console.log('Transcription:', result.text);

await stt.destroy();

Streaming Recognition

import { createStreamingSTT } from 'react-native-sherpa-onnx/stt';

const engine = await createStreamingSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-streaming-nemo-ctc-en'
  },
  modelType: 'nemo_ctc',
  enableEndpoint: true,
  numThreads: 2,
});

const stream = await engine.createStream();

// Feed audio chunks
const samples = getPcmSamplesFromMic(); // float[] in [-1, 1]
const { result, isEndpoint } = await stream.processAudioChunk(samples, 16000);

console.log('Partial result:', result.text);
if (isEndpoint) {
  console.log('Utterance ended');
}

await stream.release();
await engine.destroy();

With Hardware Acceleration

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/nemo-ctc-en' },
  modelType: 'nemo_ctc',
  provider: 'nnapi', // Android NNAPI
  numThreads: 4,
});

Model Detection

NeMo CTC models are detected by:

Folder name containing nemo or parakeet
Presence of model.onnx (or model.int8.onnx) and tokens.txt

Expected files:

model.onnx (or model.int8.onnx)
tokens.txt

Performance Tips

Use Quantized Models

Int8 quantization provides excellent speedup:

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/nemo-ctc-en' },
  preferInt8: true, // Use model.int8.onnx if available
});

Optimize for Real-Time

For streaming applications:

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/nemo-ctc-en' },
  modelType: 'nemo_ctc',
  numThreads: 4,          // More threads for lower latency
  enableEndpoint: true,   // Detect utterance boundaries
  endpointConfig: {
    rule2: {
      mustContainNonSilence: true,
      minTrailingSilence: 0.8, // 800ms of silence = end
      minUtteranceLength: 0,
    }
  },
});

Hardware Acceleration

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/nemo-ctc-en' },
  provider: 'nnapi', // Android Neural Networks API
  // provider: 'qnn',    // Qualcomm QNN for Snapdragon devices
});

Streaming Support

Streaming: ✅ YesNeMo CTC models have excellent streaming support. Use createStreamingSTT() for real-time recognition with low latency.

Advantages

Fast: CTC decoding is very fast
Low Latency: Excellent for real-time applications
Streaming: Native streaming support
High Accuracy: NVIDIA-trained models with excellent English accuracy
Low Memory: Efficient single-model architecture
Mobile-Friendly: Small models suitable for mobile deployment

Limitations

English-Focused: Primarily designed for English (limited multilingual support)
No Hotwords: Does not support contextual biasing (use transducer models for hotwords)
Domain-Specific: Best for general English (specialized domains may need fine-tuning)

Parakeet Models

NeMo Parakeet is a family of streaming ASR models:

Detected with parakeet in folder name
Same nemo_ctc model type
Optimized for low latency

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/parakeet-rnnt-en' },
  modelType: 'nemo_ctc', // Parakeet uses same type
});

Use Cases

Voice Commands

English voice control for apps and IoT devices

Live Captions

Real-time English subtitles for videos

Call Transcription

Transcribing English phone calls and meetings

Voice Assistants

English voice interfaces with fast response

Common Issues

Model not loading

Verify folder name contains nemo or parakeet
Check that model.onnx and tokens.txt are present
Ensure sufficient device memory

Poor accuracy on non-English audio

NeMo CTC models are optimized for English
Use Whisper or Paraformer for other languages
Check if a multilingual variant is available

High latency in streaming

Increase numThreads on multi-core devices
Use preferInt8: true for quantized models
Enable hardware acceleration with provider
Adjust endpoint config for faster utterance detection

Comparison with Other Models

Feature	NeMo CTC	Transducer	Whisper
Speed	Very Fast	Fast	Medium
English Accuracy	Excellent	Excellent	Very Good
Streaming	Yes	Yes	No
Hotwords	No	Yes	No
Multilingual	Limited	Varies	Excellent
Model Size	Small	Medium	Large
Latency	Very Low	Low	N/A (offline)

Next Steps

Streaming STT

Learn about real-time recognition

STT API

Detailed API documentation

Model Setup

How to download and bundle models

Execution Providers

Hardware acceleration options

Speech-to-Text Models

Text-to-Speech Models

NeMo CTC Models

NeMo CTC Models

Model Architecture

When to Use

English Streaming

Live Captions

Fast Recognition

Voice Assistants

Supported Languages

Performance Characteristics

Download Links

NeMo CTC Models

Configuration Example

Offline Transcription

Streaming Recognition

With Hardware Acceleration

Model Detection

Performance Tips

Use Quantized Models

Optimize for Real-Time

Hardware Acceleration

Streaming Support

Advantages

Limitations

Parakeet Models

Use Cases

Voice Commands

Live Captions

Call Transcription

Voice Assistants

Common Issues

Comparison with Other Models

Next Steps

Streaming STT

STT API

Model Setup

Execution Providers

Build docs developers (and LLMs) love

Speech-to-Text Models

Text-to-Speech Models

​NeMo CTC Models

​Model Architecture

​When to Use

English Streaming

Live Captions

Fast Recognition

Voice Assistants

​Supported Languages

​Performance Characteristics

​Download Links

NeMo CTC Models

​Configuration Example

​Offline Transcription

​Streaming Recognition

​With Hardware Acceleration

​Model Detection

​Performance Tips

​Use Quantized Models

​Optimize for Real-Time

​Hardware Acceleration

​Streaming Support

​Advantages

​Limitations

​Parakeet Models

​Use Cases

Voice Commands

Live Captions

Call Transcription

Voice Assistants

​Common Issues

​Comparison with Other Models

​Next Steps

Streaming STT

STT API

Model Setup

Execution Providers

Build docs developers (and LLMs) love

NeMo CTC Models

Model Architecture

When to Use

Supported Languages

Performance Characteristics

Download Links

Configuration Example

Offline Transcription

Streaming Recognition

With Hardware Acceleration

Model Detection

Performance Tips

Use Quantized Models

Optimize for Real-Time

Hardware Acceleration

Streaming Support

Advantages

Limitations

Parakeet Models

Use Cases

Common Issues

Comparison with Other Models

Next Steps