High-Level Flow
System Diagram
Core Components
Daemon and State Machine
| Component | Location | Purpose |
|---|---|---|
| Daemon | src/daemon.rs | Main event loop with tokio::select!, state coordination |
| State | src/state.rs | State machine: Idle → Recording → Transcribing → Outputting |
| CLI | src/cli.rs | Command definitions and argument parsing |
| Config | src/config.rs | TOML parsing, defaults, validation |
tokio::select!.
Hotkey Detection
Preferred: Compositor keybindings (Hyprland, Sway, River)- Native integration with compositor
- No special permissions needed
- Supports key-release events for push-to-talk
- Voxtype provides
voxtype record start/stop/togglecommands
- Kernel-level input detection via
/dev/input/event* - Works on X11 and as universal fallback
- Requires user to be in
inputgroup - Direct access to Linux input subsystem
Audio Capture
Implementation:src/audio/cpal_capture.rs
- Uses cpal (Cross-Platform Audio Library) for audio input
- Supports PipeWire, PulseAudio, and ALSA backends
- Streams audio data via mpsc channels to avoid blocking
- Records at 16kHz sample rate (Whisper’s native rate)
- Configurable max duration and device selection
Transcription Engines
Voxtype supports multiple transcription engines across two runtime backends:Whisper Backend (default)
Local in-process (whisper.rs):
- OpenAI’s Whisper model via whisper.cpp bindings
- 99 languages supported
- Optional GPU acceleration (Vulkan, CUDA, ROCm, Metal)
- Model loading hidden behind recording time with
prepare()method - Optional on-demand loading to save memory
subprocess.rs):
- Runs
whisper-clias external process - Isolates GPU memory (released after transcription)
- More stable on some systems (glibc 2.42+ compatibility)
remote.rs):
- OpenAI-compatible HTTP endpoints
- Offload transcription to remote server
- Useful for low-power devices
ONNX Engines
Engine options:- Parakeet - Fast English transcription (FastConformer TDT)
- Moonshine - Edge devices, low memory (encoder-decoder)
- SenseVoice - Chinese, Japanese, Korean (CTC encoder)
- Paraformer - Chinese-English bilingual (non-autoregressive)
- Dolphin - 40 languages + Chinese dialects, no English (CTC E-Branchformer)
- Omnilingual - 1600+ languages (wav2vec2 CTC)
- Whisper: General-purpose, best for English and European languages
- SenseVoice/Paraformer: Native CJK support
- Parakeet: Fastest English-only option
- Omnilingual: Rare and low-resource languages
voxtype setup onnx or via config:
Text Processing
Implementation:src/text/mod.rs
Word replacements: Fix commonly misheard words
function().
Post-processing command: Pipe transcriptions through external tools (LLMs, custom scripts)
Output Driver Chain
Why multiple drivers? No single output method works everywhere:- wtype needs Wayland + virtual-keyboard protocol
- KDE/GNOME don’t support virtual-keyboard protocol
- ydotool needs daemon + doesn’t support keyboard layouts
- dotool supports layouts but needs uinput access
- wtype
- dotool
- ydotool
- clipboard
Implementation:
src/output/wtype.rs- Wayland-native via virtual-keyboard protocol
- Best Unicode/CJK support
- No daemon required
- Works on: Hyprland, Sway, River, wlroots compositors
- Doesn’t work on: KDE Plasma, GNOME (protocol not implemented)
Design Decisions
Trait-Based Extensibility
Each major component defines a trait allowing multiple implementations:| Trait | Implementations | Extension Point |
|---|---|---|
HotkeyListener | EvdevListener | Add libinput, compositor-specific listeners |
AudioCapture | CpalCapture | Add JACK, direct ALSA support |
Transcriber | WhisperTranscriber, RemoteTranscriber, SubprocessTranscriber | Add new ASR backends |
TextOutput | WtypeOutput, DotoolOutput, YdotoolOutput, ClipboardOutput | Add X11, compositor-specific output |
Configuration Layering
Priority (highest wins):- CLI arguments (
--model base.en) - Environment variables (
VOXTYPE_WHISPER_MODEL) - Config file (
~/.config/voxtype/config.toml) - Built-in defaults
GPU Memory Management
Trade-off: GPU memory isn’t released after in-process transcription, causing memory growth over time. Options:- Default (in-process): Keep model loaded for faster subsequent transcriptions
- GPU isolation (
gpu_isolation = true): Spawn child process that exits after transcription, releasing GPU memory - CLI backend: Always uses subprocess, isolates GPU memory automatically
CPU Compatibility
Problem: Binaries built on modern CPUs can contain AVX-512/GFNI instructions that crash on older CPUs. Solution: Install SIGILL handler via.init_array constructor (runs before main()). If triggered, displays helpful message instead of silent crash.
Release binaries are built in Docker containers with Ubuntu 22.04/24.04 to ensure clean toolchains without modern CPU instructions.
Module Structure
Performance Considerations
- Avoid allocations in hot path: Hotkey detection and audio streaming are allocation-free
- Async I/O: Long-running operations use
spawn_blockingto avoid blocking event loop - Model loading optimization:
prepare()method loads model during recording time - Streaming over buffering: Audio data streams via channels, transcription outputs incrementally
- On-demand loading: Optional model loading only when recording starts (saves memory)