Overview
Streaming STT enables real-time speech recognition with incremental results and automatic endpoint detection. Perfect for live transcription from microphones or continuous audio streams.Quick Start
Convenient Single-Call API
Process audio chunks with one call:Supported Model Types
Only streaming-capable models work with this API:| Model Type | Description |
|---|---|
transducer | Zipformer streaming transducer |
paraformer | Paraformer streaming |
zipformer2_ctc | Zipformer2 CTC |
nemo_ctc | NVIDIA NeMo CTC |
tone_ctc | Tone CTC |
Check Model Compatibility
Engine Initialization
Initialization Options
| Option | Type | Description |
|---|---|---|
modelPath | ModelPathConfig | Path to model directory |
modelType | OnlineSTTModelType | 'auto' | Model architecture |
enableEndpoint | boolean | Enable end-of-utterance detection (default: true) |
endpointConfig | EndpointConfig | Endpoint detection rules |
decodingMethod | string | ’greedy_search’ or ‘modified_beam_search’ |
maxActivePaths | number | Beam search size (default: 4) |
hotwordsFile | string | Path to hotwords file (transducer only) |
hotwordsScore | number | Hotwords boost score (default: 1.5) |
numThreads | number | Inference threads (default: 1) |
provider | string | Execution provider |
enableInputNormalization | boolean | Auto-scale input audio (default: true) |
Stream Lifecycle
Create Stream
Create one stream per recognition session:Feed Audio
Signal End of Input
Reset Stream
Reuse the same stream for next utterance:Release Stream
Free resources when done:Endpoint Detection
Automatic detection of when an utterance ends:Using Endpoints
Typical Recording Loop
Input Normalization
By default,processAudioChunk() applies adaptive normalization (scales peak to ~0.8) to handle varying device levels.
Disable if audio is pre-normalized:
Multiple Streams
Create multiple streams from one engine:Hotwords for Streaming
For transducer models, boost specific phrases:Result Fields
Performance Tips
Threading
Execution Providers
Chunk Size
Balance between latency and overhead:- Too small: Frequent bridge calls, higher CPU overhead
- Too large: Delayed partial results
- Recommended: 100-200ms chunks (1600-3200 samples at 16 kHz)
Error Handling
Cleanup
Always release resources:Next Steps
Offline STT
Batch transcription of audio files
Model Setup
Download and configure streaming models