Skip to main content

Overview

Streaming STT enables real-time speech recognition with incremental results and automatic endpoint detection. Perfect for live transcription from microphones or continuous audio streams.

Quick Start

import { createStreamingSTT } from 'react-native-sherpa-onnx/stt';

// 1) Create streaming engine
const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  modelType: 'auto',
  enableEndpoint: true,
});

// 2) Create a stream (one per session)
const stream = await engine.createStream();

// 3) Feed audio chunks
const samples = getPcmSamplesFromMic(); // float[] in [-1, 1]
await stream.acceptWaveform(samples, 16000);

if (await stream.isReady()) {
  await stream.decode();
  const result = await stream.getResult();
  console.log('Partial:', result.text);
  
  if (await stream.isEndpoint()) {
    console.log('Utterance ended');
  }
}

// 4) Cleanup
await stream.release();
await engine.destroy();

Convenient Single-Call API

Process audio chunks with one call:
const { result, isEndpoint } = await stream.processAudioChunk(samples, 16000);
console.log(result.text);
if (isEndpoint) {
  console.log('End of utterance');
}

Supported Model Types

Only streaming-capable models work with this API:
Model TypeDescription
transducerZipformer streaming transducer
paraformerParaformer streaming
zipformer2_ctcZipformer2 CTC
nemo_ctcNVIDIA NeMo CTC
tone_ctcTone CTC
Note: Offline-only models like Whisper and SenseVoice are not supported for streaming.

Check Model Compatibility

import { getOnlineTypeOrNull } from 'react-native-sherpa-onnx/stt';

const detectedType = 'transducer';
const onlineType = getOnlineTypeOrNull(detectedType);

if (onlineType !== null) {
  // Model supports streaming
  const engine = await createStreamingSTT({
    modelPath: { type: 'asset', path: 'models/streaming-zipformer' },
    modelType: onlineType,
  });
} else {
  console.log('Model is offline-only');
}

Engine Initialization

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  modelType: 'auto',  // or explicit: 'transducer', 'paraformer', etc.
  
  // Endpoint detection
  enableEndpoint: true,
  endpointConfig: {
    rule1: {
      mustContainNonSilence: false,
      minTrailingSilence: 2.4,
      minUtteranceLength: 0,
    },
    rule2: {
      mustContainNonSilence: true,
      minTrailingSilence: 1.4,
      minUtteranceLength: 0,
    },
    rule3: {
      mustContainNonSilence: false,
      minTrailingSilence: 0,
      minUtteranceLength: 20,  // max utterance length
    },
  },
  
  // Decoding
  decodingMethod: 'greedy_search',  // or 'modified_beam_search'
  maxActivePaths: 4,
  
  // Hotwords (transducer only)
  hotwordsFile: '/path/to/hotwords.txt',
  hotwordsScore: 1.5,
  
  // Performance
  numThreads: 2,
  provider: 'cpu',  // 'nnapi', 'qnn', 'xnnpack'
  
  // Input normalization
  enableInputNormalization: true,  // Auto-scale audio chunks
});

Initialization Options

OptionTypeDescription
modelPathModelPathConfigPath to model directory
modelTypeOnlineSTTModelType | 'auto'Model architecture
enableEndpointbooleanEnable end-of-utterance detection (default: true)
endpointConfigEndpointConfigEndpoint detection rules
decodingMethodstring’greedy_search’ or ‘modified_beam_search’
maxActivePathsnumberBeam search size (default: 4)
hotwordsFilestringPath to hotwords file (transducer only)
hotwordsScorenumberHotwords boost score (default: 1.5)
numThreadsnumberInference threads (default: 1)
providerstringExecution provider
enableInputNormalizationbooleanAuto-scale input audio (default: true)

Stream Lifecycle

Create Stream

Create one stream per recognition session:
const stream = await engine.createStream();

// Optional: pass hotwords inline
const streamWithHotwords = await engine.createStream('CUSTOM PHRASE 2.0');

Feed Audio

// Accept waveform samples
await stream.acceptWaveform(samples, sampleRate);

// Check if ready to decode
if (await stream.isReady()) {
  await stream.decode();
  const result = await stream.getResult();
  console.log('Text:', result.text);
  console.log('Tokens:', result.tokens);
}

Signal End of Input

// When no more audio will be fed
await stream.inputFinished();

// Decode final buffered audio
while (await stream.isReady()) {
  await stream.decode();
  const result = await stream.getResult();
  console.log('Final:', result.text);
}

Reset Stream

Reuse the same stream for next utterance:
await stream.reset();
// Stream is now ready for new audio

Release Stream

Free resources when done:
await stream.release();
// Do not use stream after release

Endpoint Detection

Automatic detection of when an utterance ends:
const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer' },
  modelType: 'transducer',
  enableEndpoint: true,
  endpointConfig: {
    // Rule 1: Long silence, no speech required
    rule1: {
      mustContainNonSilence: false,
      minTrailingSilence: 1.0,
      minUtteranceLength: 0,
    },
    // Rule 2: Shorter silence after speech
    rule2: {
      mustContainNonSilence: true,
      minTrailingSilence: 0.8,
      minUtteranceLength: 0,
    },
    // Rule 3: Max utterance length
    rule3: {
      mustContainNonSilence: false,
      minTrailingSilence: 0,
      minUtteranceLength: 30,  // 30 seconds max
    },
  },
});

Using Endpoints

await stream.acceptWaveform(samples, 16000);

while (await stream.isReady()) {
  await stream.decode();
  const result = await stream.getResult();
  updateUI(result.text);
  
  if (await stream.isEndpoint()) {
    console.log('Utterance complete:', result.text);
    await stream.reset();  // Start fresh for next utterance
    break;
  }
}

Typical Recording Loop

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer-en' },
  modelType: 'transducer',
  enableEndpoint: true,
});

const stream = await engine.createStream();

// Start recording
const audioRecorder = startMicRecording({
  onChunk: async (samples: number[], sampleRate: number) => {
    await stream.acceptWaveform(samples, sampleRate);
    
    while (await stream.isReady()) {
      await stream.decode();
      const result = await stream.getResult();
      
      // Update UI with partial result
      setTranscript(result.text);
      
      if (await stream.isEndpoint()) {
        // Save final transcript
        saveFinalTranscript(result.text);
        
        // Reset for next utterance
        await stream.reset();
      }
    }
  },
});

// When user stops recording
function stopRecording() {
  audioRecorder.stop();
  await stream.inputFinished();
  await stream.release();
  await engine.destroy();
}

Input Normalization

By default, processAudioChunk() applies adaptive normalization (scales peak to ~0.8) to handle varying device levels. Disable if audio is pre-normalized:
const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer' },
  modelType: 'transducer',
  enableInputNormalization: false,  // Pass audio unchanged
});

Multiple Streams

Create multiple streams from one engine:
const engine = await createStreamingSTT({ /* ... */ });

const stream1 = await engine.createStream();
const stream2 = await engine.createStream();

// Use independently
await stream1.acceptWaveform(samples1, 16000);
await stream2.acceptWaveform(samples2, 16000);

// Release when done
await stream1.release();
await stream2.release();
await engine.destroy();

Hotwords for Streaming

For transducer models, boost specific phrases:
const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer' },
  modelType: 'transducer',
  hotwordsFile: '/path/to/hotwords.txt',
  hotwordsScore: 1.5,
});

// Or pass inline per stream
const stream = await engine.createStream('REACT NATIVE 2.0\nSHERPA ONNX 1.8');

Result Fields

interface StreamingSttResult {
  text: string;           // Transcribed text
  tokens: string[];       // Token list
  timestamps: number[];   // Token timestamps (model-dependent)
}

Performance Tips

Threading

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer' },
  modelType: 'transducer',
  numThreads: 4,  // Use multiple cores
});

Execution Providers

const engine = await createStreamingSTT({
  modelPath: { type: 'asset', path: 'models/streaming-zipformer' },
  modelType: 'transducer',
  provider: 'nnapi',  // Hardware acceleration (Android)
});

Chunk Size

Balance between latency and overhead:
  • Too small: Frequent bridge calls, higher CPU overhead
  • Too large: Delayed partial results
  • Recommended: 100-200ms chunks (1600-3200 samples at 16 kHz)

Error Handling

try {
  const engine = await createStreamingSTT({
    modelPath: { type: 'asset', path: 'models/streaming-zipformer' },
    modelType: 'auto',
  });
  
  const stream = await engine.createStream();
  
  await stream.acceptWaveform(samples, 16000);
  
  if (await stream.isReady()) {
    await stream.decode();
    const result = await stream.getResult();
    console.log(result.text);
  }
  
  await stream.release();
  await engine.destroy();
} catch (error) {
  console.error('Streaming STT error:', error.message);
}

Cleanup

Always release resources:
// After destroy() or release(), calling methods will throw
try {
  const engine = await createStreamingSTT({ /* ... */ });
  const stream = await engine.createStream();
  
  // ... use stream ...
  
  await stream.release();
  await engine.destroy();
} catch (error) {
  // Handle errors
} finally {
  // Ensure cleanup even on error
}

Next Steps

Offline STT

Batch transcription of audio files

Model Setup

Download and configure streaming models

Build docs developers (and LLMs) love