Skip to main content

Overview

Streaming STT (also called “online” recognition) enables real-time transcription as audio is being captured. Unlike offline STT which processes complete files, streaming STT:
  • Provides partial results as you speak
  • Detects end-of-utterance automatically
  • Works with live microphone input
  • Supports low-latency applications like voice assistants
Use streaming STT when:
  • You need real-time transcription during recording
  • You want to show partial results to users
  • You’re building voice assistants or live captioning
Use offline STT when:
  • You have complete audio files to transcribe
  • You don’t need real-time results
  • You’re processing pre-recorded audio

Supported Models

Only specific model types support streaming:
Model TypeDescriptionFiles
transducerTransducer (zipformer)encoder.onnx, decoder.onnx, joiner.onnx, tokens.txt
paraformerParaformer streamingencoder.onnx, decoder.onnx, tokens.txt
zipformer2_ctcZipformer2 CTCmodel.onnx, tokens.txt
nemo_ctcNeMo CTCmodel.onnx, tokens.txt
tone_ctcT-One CTCmodel.onnx, tokens.txt
Offline-only models like Whisper, SenseVoice, and Dolphin do not support streaming. Use getOnlineTypeOrNull() to check if a model supports streaming.

Quick Start

import { createStreamingSTT } from 'react-native-sherpa-onnx/stt';

// Create streaming engine
const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/sherpa-onnx-streaming-zipformer-en' },
  modelType: 'transducer',  // or 'auto' to detect
  enableEndpoint: true,
});

// Create a stream
const stream = await engine.createStream();

// Feed audio chunks from microphone
const samples = getPcmSamplesFromMic(); // Float array [-1, 1]
await stream.acceptWaveform(samples, 16000);

// Check if ready to decode
if (await stream.isReady()) {
  await stream.decode();
  const result = await stream.getResult();
  console.log('Partial:', result.text);
  
  // Check for end of utterance
  if (await stream.isEndpoint()) {
    console.log('Utterance ended');
    await stream.reset();  // Reset for next utterance
  }
}

// Cleanup
await stream.release();
await engine.destroy();

Checking Model Support

Before creating a streaming engine, check if the model supports streaming:
import { detectSttModel, getOnlineTypeOrNull } from 'react-native-sherpa-onnx/stt';

// Detect model type
const result = await detectSttModel(
  { type: 'asset', path: 'models/my-model' }
);

// Check if streaming is supported
const onlineType = getOnlineTypeOrNull(result.modelType);

if (onlineType !== null) {
  // Model supports streaming
  console.log('Can use streaming with type:', onlineType);
  const engine = await createStreamingSTT({
    modelPath: { type: 'asset', path: 'models/my-model' },
    modelType: onlineType,
  });
} else {
  // Model is offline-only
  console.log('Model does not support streaming, use createSTT() instead');
}

API Reference

createStreamingSTT(options)

Creates a streaming STT engine for real-time recognition.
src/stt/streaming.ts
export async function createStreamingSTT(
  options: StreamingSttInitOptions
): Promise<StreamingSttEngine>;
Options:
modelPath
ModelPathConfig
required
Model directory path configuration. Use { type: 'asset', path: '...' } for bundled models.
modelType
OnlineSTTModelType | 'auto'
default:"auto"
Model type: 'transducer', 'paraformer', 'zipformer2_ctc', 'nemo_ctc', 'tone_ctc', or 'auto' to detect.
enableEndpoint
boolean
default:"true"
Enable automatic end-of-utterance detection.
endpointConfig
EndpointConfig
Fine-tune endpoint detection rules. See Endpoint Detection.
decodingMethod
'greedy_search' | 'modified_beam_search'
default:"greedy_search"
Decoding algorithm. Beam search is slower but may be more accurate.
maxActivePaths
number
default:"4"
Beam size for beam search decoding.
hotwordsFile
string
Path to hotwords file (transducer models only).
hotwordsScore
number
default:"1.5"
Hotwords boost score.
numThreads
number
default:"1"
Number of threads for inference.
provider
string
default:"cpu"
Execution provider (e.g., 'cpu', 'qnn', 'nnapi').
enableInputNormalization
boolean
default:"true"
Automatically scale audio chunks to optimal levels. Disable if your audio is already normalized.

StreamingSttEngine

The engine manages the recognizer and creates streams.
instanceId
string
Read-only engine identifier.
createStream
(hotwords?: string) => Promise<SttStream>
Creates a new recognition stream. Optional hotwords string for per-stream contextual biasing.
destroy
() => Promise<void>
Releases native resources. Must be called when done.

SttStream

A stream represents one recognition session (e.g., one utterance).

acceptWaveform(samples, sampleRate)

Feed audio samples to the stream.
await stream.acceptWaveform(
  samples,    // Float32Array or number[] in [-1, 1]
  16000       // Sample rate in Hz
);

isReady()

Check if there’s enough audio buffered to decode.
const ready = await stream.isReady();
if (ready) {
  await stream.decode();
}

decode()

Run decoding on buffered audio. Call when isReady() returns true.
await stream.decode();

getResult()

Get the current partial or final result.
const result = await stream.getResult();
console.log('Text:', result.text);
console.log('Tokens:', result.tokens);
console.log('Timestamps:', result.timestamps);

isEndpoint()

Check if end-of-utterance was detected.
if (await stream.isEndpoint()) {
  console.log('Speaker stopped talking');
  await stream.reset();  // Reset for next utterance
}

reset()

Reset stream state for reuse. Call after endpoint or to start a new utterance.
await stream.reset();

inputFinished()

Signal that no more audio will be fed. Use when recording stops.
await stream.inputFinished();
// Decode any remaining audio
while (await stream.isReady()) {
  await stream.decode();
}
const finalResult = await stream.getResult();

release()

Release native stream resources. Do not use the stream after this.
await stream.release();

processAudioChunk(samples, sampleRate)

Convenience method that combines accept + decode + getResult in one call.
const { result, isEndpoint } = await stream.processAudioChunk(samples, 16000);
console.log(result.text);
if (isEndpoint) {
  console.log('Utterance ended');
}
Use processAudioChunk() to reduce bridge round-trips from 5 calls to 1 per audio chunk.

Endpoint Detection

Endpoint detection automatically determines when the user has stopped speaking.

Default Rules

Three rules are evaluated in order (first match wins):
  1. Rule 1: 2.4s of trailing silence (no speech required)
  2. Rule 2: 1.4s of trailing silence + speech detected
  3. Rule 3: Max utterance length of 20s

Custom Endpoint Configuration

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  modelType: 'transducer',
  enableEndpoint: true,
  endpointConfig: {
    rule1: {
      mustContainNonSilence: false,
      minTrailingSilence: 1.0,    // Shorter = faster end
      minUtteranceLength: 0,
    },
    rule2: {
      mustContainNonSilence: true,
      minTrailingSilence: 0.8,
      minUtteranceLength: 0,
    },
    rule3: {
      mustContainNonSilence: false,
      minTrailingSilence: 0,
      minUtteranceLength: 30,      // Max 30s per utterance
    },
  },
});

Live Microphone Integration

For live microphone capture with automatic resampling, use the PCM Live Stream API:
import { createStreamingSTT } from 'react-native-sherpa-onnx/stt';
import { createPcmLiveStream } from 'react-native-sherpa-onnx/audio';

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  modelType: 'transducer',
});

const stream = await engine.createStream();

// Create microphone stream with automatic resampling to 16kHz
const micStream = await createPcmLiveStream({
  sampleRate: 16000,  // Resample to 16kHz
  channels: 1,        // Mono
});

micStream.onData = async (chunk) => {
  // Feed to STT
  const { result, isEndpoint } = await stream.processAudioChunk(
    chunk.samples,
    chunk.sampleRate
  );
  
  if (result.text) {
    console.log('Partial:', result.text);
  }
  
  if (isEndpoint) {
    console.log('Final:', result.text);
    await stream.reset();
  }
};

// Start recording
await micStream.start();

// Later: stop recording
await micStream.stop();
await stream.release();
await engine.destroy();

Common Patterns

Typical Recognition Loop

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  modelType: 'transducer',
});

const stream = await engine.createStream();

async function onAudioChunk(samples: number[], sampleRate: number) {
  await stream.acceptWaveform(samples, sampleRate);
  
  while (await stream.isReady()) {
    await stream.decode();
    const result = await stream.getResult();
    
    if (result.text) {
      updateUI(result.text);
    }
    
    if (await stream.isEndpoint()) {
      console.log('Final result:', result.text);
      await stream.reset();
      break;
    }
  }
}

// Feed chunks from microphone
microphone.onData = (chunk) => onAudioChunk(chunk.samples, chunk.sampleRate);

Using processAudioChunk (Simplified)

const stream = await engine.createStream();

for (const chunk of audioChunks) {
  const { result, isEndpoint } = await stream.processAudioChunk(
    chunk.samples,
    16000
  );
  
  if (result.text) {
    setTranscript((prev) => prev + ' ' + result.text);
  }
  
  if (isEndpoint) {
    console.log('Utterance complete');
    break;
  }
}

await stream.release();

Multiple Streams

Create multiple streams from one engine (e.g., for different channels):
const engine = await createStreamingSTT({ /* ... */ });

const stream1 = await engine.createStream();
const stream2 = await engine.createStream();

// Use streams independently
// ...

await stream1.release();
await stream2.release();
await engine.destroy();

Hotwords in Streaming

For transducer models, you can use hotwords for contextual biasing:
// At engine creation
const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  modelType: 'transducer',
  hotwordsFile: '/path/to/hotwords.txt',
  hotwordsScore: 1.5,
});

// Or per stream
const stream = await engine.createStream(
  'REACT NATIVE 2.0\nSHERPA ONNX\nTURBOMODULE 1.5'
);

Input Normalization

By default, processAudioChunk() applies adaptive normalization to handle varying microphone levels:
// Default: normalization enabled
const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  enableInputNormalization: true,  // Default
});

// Disable if your audio is already normalized
const engine2 = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  enableInputNormalization: false,
});
Normalization scales each chunk so the peak is around 0.8, which helps with quiet iOS mics or varying Android devices.

Performance Tips

Threading

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  numThreads: 2,  // Increase for faster decoding
});

Hardware Acceleration

import { getQnnSupport } from 'react-native-sherpa-onnx';

const qnnSupport = await getQnnSupport();
if (qnnSupport.canInit) {
  const engine = await createStreamingSTT({
    modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
    provider: 'qnn',  // Use Qualcomm NPU
  });
}

Reduce Latency

  • Use processAudioChunk() instead of separate method calls
  • Keep audio chunk sizes reasonable (e.g., 0.1s - 0.5s worth of samples)
  • Increase numThreads on multi-core devices
  • Use hardware acceleration when available

Troubleshooting

Only transducer, paraformer, zipformer2_ctc, nemo_ctc, and tone_ctc support streaming. Whisper, SenseVoice, and Dolphin are offline-only. Use getOnlineTypeOrNull() to check support.
  • Ensure audio is 16 kHz mono
  • Check microphone permissions and quality
  • Verify audio samples are in range [-1, 1]
  • Try disabling enableInputNormalization if audio is already normalized
  • Increase hotwordsScore for better keyword recognition
Adjust endpointConfig rules:
  • Too early: Increase minTrailingSilence
  • Too late: Decrease minTrailingSilence
  • For long utterances: Increase minUtteranceLength in rule3
  • Reduce audio chunk size
  • Increase numThreads
  • Use hardware acceleration (QNN, NNAPI)
  • Use processAudioChunk() to reduce bridge calls

Next Steps

Offline STT

Transcribe complete audio files

Model Setup

Learn how to bundle and load models

Execution Providers

Hardware acceleration options

Text-to-Speech

Generate speech from text

Build docs developers (and LLMs) love