Skip to main content

Overview

The transcribe command converts an audio file to text using the configured transcription engine. It outputs the transcribed text to stdout, making it suitable for shell scripts and pipelines. Unlike the daemon mode, this command:
  • Processes a single file and exits
  • Does not type or paste output (prints to stdout)
  • Does not require hotkey detection
  • Can be used in batch processing workflows

Basic usage

voxtype transcribe <file>

Example

voxtype transcribe recording.wav
Output:
Loading audio file: "recording.wav"
Audio format: 44100 Hz, 2 channel(s), Int
Resampling from 44100 Hz to 16000 Hz...
Processing 32000 samples (2.00s)...

This is the transcribed text from the audio file.

Audio format requirements

Voxtype automatically handles audio conversion, but for best results:
  • Format: WAV (.wav)
  • Sample rate: 16000 Hz (16kHz) - other rates will be resampled
  • Channels: Mono (1 channel) - stereo will be mixed to mono
  • Bit depth: 16-bit or 32-bit integer, or 32-bit float
Use ffmpeg to convert audio files to the optimal format:
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
voxtype transcribe output.wav

Engine override

--engine
string
Override the transcription engine for this file. Options: whisper, parakeet, moonshine, sensevoice, paraformer, dolphin, omnilingual.
voxtype transcribe audio.wav --engine parakeet
Defaults to the engine configured in config.toml.

Configuration inheritance

The transcribe command uses your daemon configuration from config.toml:
  • Whisper model selection ([whisper] model)
  • Language settings ([whisper] language)
  • Engine selection (engine = "whisper")
  • GPU acceleration settings
  • VAD configuration ([vad])
You can override most settings via CLI flags:
voxtype --model large-v3-turbo \
        --language en \
        --threads 4 \
        transcribe audio.wav

Voice Activity Detection (VAD)

If VAD is enabled in your config, it will filter silence before transcription:
voxtype --vad transcribe recording.wav
Example output with VAD:
Loading audio file: "recording.wav"
Audio format: 16000 Hz, 1 channel(s), Int
Processing 80000 samples (5.00s)...
VAD: 3.24s speech (64.8% of audio)

The actual transcribed speech content.
If no speech is detected, transcription is skipped:
VAD: 0.12s speech (2.4% of audio)
No speech detected, skipping transcription.
VAD is especially useful for batch processing to skip silence-only files and prevent Whisper hallucinations.

Shell integration examples

Save to file

voxtype transcribe recording.wav > transcript.txt

Batch processing

for file in recordings/*.wav; do
  echo "Processing $file..."
  voxtype transcribe "$file" > "${file%.wav}.txt"
done

Filter with VAD

#!/bin/bash
# Only transcribe files with speech

for file in *.wav; do
  if voxtype --vad transcribe "$file" 2>&1 | grep -q "No speech detected"; then
    echo "Skipping $file (no speech)"
  else
    voxtype transcribe "$file" > "${file%.wav}.txt"
  fi
done

Pipeline with post-processing

# Transcribe and format with LLM
voxtype transcribe meeting.wav | \
  llm -m gpt-4o-mini 'Format as meeting notes with bullet points' > notes.md

Combine multiple recordings

# Concatenate transcripts from multiple files
for file in part*.wav; do
  echo "\n## $(basename $file .wav)"
  voxtype transcribe "$file"
done > full-transcript.txt

Extract audio from video

# Extract audio from video and transcribe
ffmpeg -i video.mp4 -ar 16000 -ac 1 -f wav - | \
  voxtype transcribe /dev/stdin > video-transcript.txt
Reading from /dev/stdin requires WAV format. Use ffmpeg’s -f wav output.

Performance considerations

Model loading

The first transcription loads the model into memory, which can take 1-5 seconds depending on model size. Subsequent calls in the same process reuse the loaded model. For batch processing, it’s more efficient to keep the daemon running and use the API than to call transcribe repeatedly:
# Inefficient: Loads model for each file
for file in *.wav; do
  voxtype transcribe "$file"
done

# Better: Use daemon with record commands or implement batch API

GPU memory

GPU memory is allocated when the model loads and persists until the process exits. For long-running transcription jobs:
# Release GPU memory after each file
for file in *.wav; do
  voxtype --gpu-isolation transcribe "$file" > "${file%.wav}.txt"
done
The --gpu-isolation flag runs transcription in a subprocess that exits after completion, releasing VRAM.

Output format

The transcribed text is printed to stdout without additional formatting. Diagnostic messages go to stderr, so you can safely redirect output:
# Only transcript goes to file
voxtype transcribe audio.wav > transcript.txt

# Both transcript and diagnostics
voxtype transcribe audio.wav &> full-output.txt

# Transcript to file, diagnostics to terminal
voxtype transcribe audio.wav > transcript.txt 2>&1 | grep -v "Loading"

Troubleshooting

Unsupported audio format

Error: failed to decode WAV file
Solution: Convert to WAV with ffmpeg:
ffmpeg -i audio.mp3 -ar 16000 -ac 1 audio.wav
voxtype transcribe audio.wav

Model not found

Error: Model 'large-v3-turbo' not found
Solution: Download the model:
voxtype setup --download --model large-v3-turbo

VAD model missing

VAD warning: Model not found at ~/.local/share/voxtype/models/ggml-silero-vad.bin
Solution: Download VAD model:
voxtype setup vad

Hallucinations on silence

Whisper may generate nonsensical text when processing silence-only audio. Solution: Enable VAD to filter silence:
voxtype --vad transcribe audio.wav

Advanced examples

Multilingual transcription

# Auto-detect language
voxtype transcribe meeting.wav

# Force specific languages
voxtype --language "en,fr,es" transcribe meeting.wav

# Translate to English
voxtype --translate transcribe french-audio.wav

Different models for different content

#!/bin/bash
# Use appropriate model based on audio length

duration=$(ffprobe -v error -show_entries format=duration \
  -of default=noprint_wrappers=1:nokey=1 "$1")

if (( $(echo "$duration < 30" | bc -l) )); then
  # Short clips: use fast model
  voxtype --model base.en transcribe "$1"
else
  # Long recordings: use accurate model
  voxtype --model large-v3-turbo transcribe "$1"
fi

Custom processing pipeline

#!/bin/bash
# Transcribe, clean up, and format

voxtype transcribe "$1" | \
  # Remove filler words
  sed 's/\b(um|uh|like|you know)\b//gi' | \
  # Capitalize sentences
  awk '{print toupper(substr($0,1,1)) tolower(substr($0,2))}' | \
  # Format with LLM
  llm -m gpt-4o-mini 'Fix punctuation and format as paragraphs' > output.txt

Timestamp extraction (with custom processing)

While voxtype transcribe doesn’t output timestamps, you can use Whisper’s CLI for this:
# Use whisper-rs CLI for timestamped output
whisper-cli --model ~/.local/share/voxtype/models/ggml-large-v3-turbo.bin \
  --output-format srt audio.wav

# Or use OpenAI's whisper directly
whisper audio.wav --output_format srt

See also

Build docs developers (and LLMs) love