Overview
Streaming STT (also called “online” recognition) enables real-time transcription as audio is being captured. Unlike offline STT which processes complete files, streaming STT:- Provides partial results as you speak
- Detects end-of-utterance automatically
- Works with live microphone input
- Supports low-latency applications like voice assistants
- You need real-time transcription during recording
- You want to show partial results to users
- You’re building voice assistants or live captioning
- You have complete audio files to transcribe
- You don’t need real-time results
- You’re processing pre-recorded audio
Supported Models
Only specific model types support streaming:| Model Type | Description | Files |
|---|---|---|
transducer | Transducer (zipformer) | encoder.onnx, decoder.onnx, joiner.onnx, tokens.txt |
paraformer | Paraformer streaming | encoder.onnx, decoder.onnx, tokens.txt |
zipformer2_ctc | Zipformer2 CTC | model.onnx, tokens.txt |
nemo_ctc | NeMo CTC | model.onnx, tokens.txt |
tone_ctc | T-One CTC | model.onnx, tokens.txt |
Quick Start
Checking Model Support
Before creating a streaming engine, check if the model supports streaming:API Reference
createStreamingSTT(options)
Creates a streaming STT engine for real-time recognition.src/stt/streaming.ts
Model directory path configuration. Use
{ type: 'asset', path: '...' } for bundled models.Model type:
'transducer', 'paraformer', 'zipformer2_ctc', 'nemo_ctc', 'tone_ctc', or 'auto' to detect.Enable automatic end-of-utterance detection.
Fine-tune endpoint detection rules. See Endpoint Detection.
Decoding algorithm. Beam search is slower but may be more accurate.
Beam size for beam search decoding.
Path to hotwords file (transducer models only).
Hotwords boost score.
Number of threads for inference.
Execution provider (e.g.,
'cpu', 'qnn', 'nnapi').Automatically scale audio chunks to optimal levels. Disable if your audio is already normalized.
StreamingSttEngine
The engine manages the recognizer and creates streams.Read-only engine identifier.
Creates a new recognition stream. Optional hotwords string for per-stream contextual biasing.
Releases native resources. Must be called when done.
SttStream
A stream represents one recognition session (e.g., one utterance).acceptWaveform(samples, sampleRate)
Feed audio samples to the stream.isReady()
Check if there’s enough audio buffered to decode.decode()
Run decoding on buffered audio. Call whenisReady() returns true.
getResult()
Get the current partial or final result.isEndpoint()
Check if end-of-utterance was detected.reset()
Reset stream state for reuse. Call after endpoint or to start a new utterance.inputFinished()
Signal that no more audio will be fed. Use when recording stops.release()
Release native stream resources. Do not use the stream after this.processAudioChunk(samples, sampleRate)
Convenience method that combines accept + decode + getResult in one call.Endpoint Detection
Endpoint detection automatically determines when the user has stopped speaking.Default Rules
Three rules are evaluated in order (first match wins):- Rule 1: 2.4s of trailing silence (no speech required)
- Rule 2: 1.4s of trailing silence + speech detected
- Rule 3: Max utterance length of 20s
Custom Endpoint Configuration
Live Microphone Integration
For live microphone capture with automatic resampling, use the PCM Live Stream API:Common Patterns
Typical Recognition Loop
Using processAudioChunk (Simplified)
Multiple Streams
Create multiple streams from one engine (e.g., for different channels):Hotwords in Streaming
For transducer models, you can use hotwords for contextual biasing:Input Normalization
By default,processAudioChunk() applies adaptive normalization to handle varying microphone levels:
Normalization scales each chunk so the peak is around 0.8, which helps with quiet iOS mics or varying Android devices.
Performance Tips
Threading
Hardware Acceleration
Reduce Latency
- Use
processAudioChunk()instead of separate method calls - Keep audio chunk sizes reasonable (e.g., 0.1s - 0.5s worth of samples)
- Increase
numThreadson multi-core devices - Use hardware acceleration when available
Troubleshooting
Error: Model type not supported for streaming
Error: Model type not supported for streaming
Only
transducer, paraformer, zipformer2_ctc, nemo_ctc, and tone_ctc support streaming. Whisper, SenseVoice, and Dolphin are offline-only. Use getOnlineTypeOrNull() to check support.Poor recognition quality
Poor recognition quality
- Ensure audio is 16 kHz mono
- Check microphone permissions and quality
- Verify audio samples are in range [-1, 1]
- Try disabling
enableInputNormalizationif audio is already normalized - Increase
hotwordsScorefor better keyword recognition
Endpoint triggers too early/late
Endpoint triggers too early/late
Adjust
endpointConfig rules:- Too early: Increase
minTrailingSilence - Too late: Decrease
minTrailingSilence - For long utterances: Increase
minUtteranceLengthin rule3
High latency or stuttering
High latency or stuttering
- Reduce audio chunk size
- Increase
numThreads - Use hardware acceleration (QNN, NNAPI)
- Use
processAudioChunk()to reduce bridge calls
Next Steps
Offline STT
Transcribe complete audio files
Model Setup
Learn how to bundle and load models
Execution Providers
Hardware acceleration options
Text-to-Speech
Generate speech from text