Overview
Moonshine Voice is designed with a simple, event-based architecture that abstracts away the complexity of voice processing. The framework provides a high-level API that lets developers focus on building applications rather than managing audio pipelines.Design Philosophy
The basic flow is straightforward:- Create a
TranscriberorIntentRecognizerobject - Attach an
EventListenerthat gets called when important events occur - Feed in audio and respond to events
Batteries Included: Moonshine Voice includes all stages of the voice processing pipeline in a single library - microphone capture, voice activity detection, speech to text, speaker identification, and intent recognition.
Architecture Diagram
Traditionally, adding a voice interface required integrating multiple libraries for different processing stages. Moonshine Voice consolidates these into one framework:Core Components
Transcriber
TheTranscriber (defined in python/src/moonshine_voice/transcriber.py:68) is the main entry point for speech-to-text functionality:
- Loading and managing Moonshine ASR models
- Coordinating the processing pipeline
- Managing audio streams
- Emitting events to listeners
Stream
Streams (defined inpython/src/moonshine_voice/transcriber.py:321) handle real-time audio input:
- Multiple streams per transcriber for multiple audio sources
- Independent transcripts per stream
- Automatic periodic updates based on
update_interval - Event emission when transcript changes
Event Listeners
The event system (python/src/moonshine_voice/transcriber.py:290) provides reactive updates:
LineStarted- New speech segment detectedLineUpdated- Any line property changedLineTextChanged- Transcription text updatedLineCompleted- Speech segment finishedError- Processing error occurred
Processing Pipeline
When audio is added viaadd_audio(), it flows through this pipeline:
1. Audio Buffering
Raw PCM audio is converted to 16kHz mono format internally, regardless of input sample rate.2. Voice Activity Detection (VAD)
The Silero VAD model (core/silero-vad.h) segments continuous audio into speech phrases:
- Runs every 30ms on audio chunks
- Averages results over a window (default 0.5s) for stability
- Uses a threshold (default 0.5) to distinguish speech from silence
- Adds padding and look-behind to avoid clipping speech
3. Speech-to-Text
Moonshine ASR models transcribe segmented audio:- Non-streaming models: Process complete segments
- Streaming models: Cache encoder output and decoder state for incremental updates
4. Speaker Identification (Optional)
Pyannote embedding model identifies speakers for diarization:- Requires sufficient audio data per segment
- Assigns unique
speaker_id(64-bit integer) - Provides
speaker_indexfor “Speaker 1”, “Speaker 2” labeling
5. Event Emission
The stream analyzes transcript changes and emits appropriate events to all registered listeners.Cross-Platform Architecture
Moonshine Voice runs consistently across platforms:core/) is the single source of truth, with language-specific bindings providing idiomatic interfaces.
Thread Safety
Fromcore/moonshine-c-api.h:64-66:
All API calls are thread-safe, so you can call them from multiple threads concurrently. However, calculations on a single transcriber are serialized, so latency will be affected for calls from other threads while the transcriber is busy.
Session Management
Sessions define the lifecycle of transcription:start()- Begins new session, resets transcriptadd_audio()- Feeds audio into active sessionstop()- Ends session, completes any active lines
start() resets the transcript, so save any data you need beforehand.
Resource Management
The architecture uses handle-based resource management:Next Steps
Transcription
Understand the speech-to-text pipeline
Streaming
Learn how streaming reduces latency
Intent Recognition
Build voice command interfaces
Models
Explore model architectures