Overview
The daemon command starts Voxtype in its primary mode, running as a foreground process that listens for hotkey events and performs voice-to-text transcription. When no subcommand is specified,voxtype defaults to voxtype daemon.
Hotkey detection
The daemon supports two hotkey detection modes:- Built-in evdev listener (default) - Kernel-level hotkey detection via evdev. Requires user to be in the
inputgroup. - Compositor keybindings - Use your window manager’s native keybinding system (recommended for Wayland). Disable built-in detection with
--no-hotkeyand configure compositor to callvoxtype record start/stop/toggle.
Activation modes
- Push-to-talk
- Toggle
Hold the hotkey to record, release to transcribe. Default behavior.
Configuration
All daemon settings can be configured via:- CLI flags (highest priority) - Override any setting for the current session
- Environment variables - Use
VOXTYPE_*prefix (e.g.,VOXTYPE_MODEL=large-v3-turbo) - Config file -
~/.config/voxtype/config.toml - Defaults - Built-in sensible defaults
Configuration file
Path to configuration file. Defaults to
~/.config/voxtype/config.toml.Verbosity and logging
Increase logging verbosity. Use
-v for debug output, -vv for trace output.Suppress all output except errors.
Hotkey configuration
Override the hotkey for recording. Examples: Use
SCROLLLOCK, PAUSE, F13, MEDIA, WEV_234, EVTEST_226.wev (Wayland) or evtest (X11/Wayland) to discover key codes:Use toggle mode instead of push-to-talk. Press hotkey once to start recording, press again to stop.
Disable built-in hotkey detection. Use this when relying on compositor keybindings.Then configure compositor keybindings:
Key to abort recording or transcription without outputting. Examples:
ESC, BACKSPACE, F12.Modifier key for selecting secondary model during recording. Hold this key while activating the hotkey to use
secondary_model.Transcription engine
Override transcription engine. Options:
whisper, parakeet, moonshine, sensevoice, paraformer, dolphin, omnilingual.Override model for transcription.Whisper models:
tiny,tiny.enbase,base.ensmall,small.enmedium,medium.enlarge-v3,large-v3-turbo
parakeet-tdt-0.6b-v3parakeet-tdt-0.6b-v3-int8
Whisper options
Provide context to guide transcription style, terminology, or formatting. Hints at proper nouns and conventions.
Language for transcription. Use
auto for detection, or specify code(s): en, fr, es, etc. Supports comma-separated list for multilingual: en,fr,de.Translate non-English speech to English during transcription.
Number of CPU threads for inference. Default is automatic based on CPU cores.
Run transcription in a subprocess that exits after completion, releasing GPU memory. Useful for preventing VRAM accumulation over multiple recordings.
Load model when recording starts instead of keeping it loaded in memory. Reduces idle memory usage at the cost of slower first transcription.
Disable automatic context window optimization for short recordings. By default, Voxtype uses smaller context windows for recordings under 10 seconds to improve speed.
Whisper execution mode:
local (in-process), remote (API), or cli (external binary).Model to use when holding the
model_modifier key. Useful for switching to a larger/more accurate model for difficult audio.Start transcribing audio chunks while recording continues. Experimental feature for faster perceived response.
Remote Whisper options
API endpoint URL for remote Whisper mode. Supports OpenAI-compatible APIs.
Model name to send to remote API.
API key for remote server. Can also use
VOXTYPE_WHISPER_API_KEY environment variable.Audio configuration
Audio input device name. Use
default for system default, or specify device name from pactl list sources (PulseAudio/PipeWire) or arecord -L (ALSA).Maximum recording duration in seconds (safety limit). Default is 300 seconds (5 minutes).
Enable audio feedback sounds (beeps when recording starts/stops).
Disable audio feedback sounds.
Output configuration
Force clipboard-only output mode. Transcribed text is copied to clipboard without typing.
Force paste mode: copy to clipboard and simulate Ctrl+V keystroke.
Save clipboard content before paste mode and restore it after paste completes. Preserves your previous clipboard state.
Delay in milliseconds after paste before restoring clipboard. Default is 200ms. Increase if paste target hasn’t processed the content yet.
Delay in milliseconds before typing starts. Helps prevent first character drop in some applications. Default is 0.
Delay between typed characters in milliseconds. Use for applications that can’t handle fast typing. Default is 0 (fastest).
Text to append after each transcription. Applied before
auto_submit. Useful for adding trailing spaces or punctuation.Output driver order for type mode (comma-separated). Available: Default order:
wtype, dotool, ydotool, clipboard.wtype → dotool → ydotool → clipboardAutomatically press Enter after outputting transcribed text.
Disable auto-submit (overrides config setting).
Convert newlines in transcription to Shift+Enter instead of regular Enter. Useful for chat applications that send on Enter.
Disable Shift+Enter newlines (overrides config).
Fall back to clipboard if typing fails.
Disable clipboard fallback.
Enable spoken punctuation conversion. Say “period”, “comma”, “question mark” to insert punctuation.
Keystroke combination for paste mode. Examples:
ctrl+v, shift+insert, ctrl+shift+v.Keyboard layout for dotool output (e.g.,
de, fr). Used when dotool is active.Keyboard layout variant for dotool (e.g.,
nodeadkeys).File path for file output mode. Used with
--file in record commands or when output mode is set to file.File write mode:
overwrite or append.Command to run before typing output. Useful for compositor submap switching.
Command to run after typing output.
Command to run when recording starts. Useful for visual indicators or compositor state changes.
Voice Activity Detection (VAD)
Enable Voice Activity Detection to filter silence before transcription. Prevents Whisper hallucinations on silence-only recordings.
Speech detection threshold (0.0-1.0). Lower values are more sensitive. Default is 0.5.
VAD backend to use:
auto, energy, whisper.auto- Whisper VAD for Whisper engine, Energy for ONNX enginesenergy- Simple RMS-based detection, no model neededwhisper- Silero model via whisper-rs, requires model download
Minimum speech duration in milliseconds for VAD to consider audio as containing speech.
Systemd service mode
Run Voxtype as a systemd user service for automatic startup:voxtype daemon with settings from your config file.
Example configurations
- Basic usage
- Remote Whisper
- Compositor bindings
- Paste mode with clipboard restore
- Multilingual
- GPU memory optimization
Default settings with scroll lock hotkey:
See also
- voxtype record - Control recording from external sources
- voxtype transcribe - Transcribe audio files
- Configuration guide - Detailed config file documentation