Overview
The Audio backend processes audio files and converts speech to text using Automatic Speech Recognition (ASR) models. It supports various audio formats and uses Whisper-based models for high-quality transcription.Features
- Multi-format support - MP3, WAV, FLAC, OGG, M4A, WEBM
- Video audio tracks - Extracts and transcribes audio from video files
- Whisper models - Multiple model sizes for speed/accuracy tradeoffs
- Timestamp information - Preserves timing data in transcription
- Speaker diarization - Optional speaker identification (model-dependent)
- Language detection - Automatic language identification
- Multi-language support - Supports 90+ languages
Supported Formats
Audio formats
- MP3 (.mp3)
- WAV (.wav)
- FLAC (.flac)
- OGG (.ogg)
- M4A (.m4a)
- OPUS (.opus)
Video formats (audio track)
- MP4 (.mp4)
- WEBM (.webm)
- MKV (.mkv)
- AVI (.avi)
Usage
Basic Transcription
With Pipeline Options
AsrPipelineOptions
Configuration options for the Automatic Speech Recognition pipeline.Parameters
Automatic Speech Recognition (ASR) model configuration for audio transcription.Specifies which ASR model to use (e.g., Whisper variants) and model-specific parameters for speech-to-text conversion.
document_timeoutaccelerator_optionsenable_remote_servicesallow_external_pluginsartifacts_path
Available Whisper Models
Docling provides pre-configured Whisper model options:Model Selection Guide
WHISPER_TINY
WHISPER_TINY
Best for:
- Quick drafts or previews
- Resource-constrained environments
- Real-time transcription
- Simple audio with clear speech
- Size: ~75M parameters
- Speed: Very fast
- Accuracy: Basic
- VRAM: ~1GB
WHISPER_BASE
WHISPER_BASE
Best for:
- General-purpose transcription
- Meetings and interviews
- Balanced speed and accuracy
- Size: ~140M parameters
- Speed: Fast
- Accuracy: Good
- VRAM: ~1-2GB
WHISPER_SMALL
WHISPER_SMALL
Best for:
- Production transcription
- Podcasts and lectures
- Better accuracy needed
- Size: ~240M parameters
- Speed: Medium
- Accuracy: Very good
- VRAM: ~2GB
WHISPER_MEDIUM
WHISPER_MEDIUM
Best for:
- Professional transcription
- Complex audio environments
- Multiple speakers
- Size: ~760M parameters
- Speed: Slow
- Accuracy: Excellent
- VRAM: ~5GB
WHISPER_LARGE
WHISPER_LARGE
Best for:
- Maximum accuracy requirements
- Difficult audio conditions
- Multi-lingual content
- Publication-quality transcripts
- Size: ~1.5B parameters
- Speed: Very slow
- Accuracy: Best
- VRAM: ~10GB
Language Support
Whisper supports 90+ languages with automatic detection:GPU Acceleration
Enable GPU for faster transcription:Output Structure
Transcription appears as text items in the document:Advanced Usage
Batch Audio Processing
Extract from Video
Long Audio Files
Performance Optimization
Model Selection
Model Selection
Choose model based on requirements:
| Use Case | Model | Speed | Accuracy |
|---|---|---|---|
| Draft/Preview | TINY | ⚡⚡⚡⚡ | ⭐⭐ |
| General | BASE | ⚡⚡⚡ | ⭐⭐⭐ |
| Production | SMALL | ⚡⚡ | ⭐⭐⭐⭐ |
| Professional | MEDIUM | ⚡ | ⭐⭐⭐⭐⭐ |
| Maximum Quality | LARGE | 🐌 | ⭐⭐⭐⭐⭐⭐ |
GPU Acceleration
GPU Acceleration
Enable GPU for significant speedup:Requirements:
- NVIDIA GPU with CUDA support
- Or Apple Silicon with MPS support
Memory Management
Memory Management
Larger models require more VRAM:
- TINY/BASE: 1-2GB VRAM
- SMALL: 2-3GB VRAM
- MEDIUM: 5-6GB VRAM
- LARGE: 10GB+ VRAM
- Use smaller model
- Process on CPU (slower)
- Chunk long audio files
Limitations
Troubleshooting
Poor transcription quality
Poor transcription quality
Solutions:
- Use larger model (MEDIUM or LARGE)
- Ensure audio is clear (reduce background noise)
- Check correct language is detected
- Verify audio format is supported
Out of memory
Out of memory
Solutions:
- Use smaller model (TINY or BASE)
- Process on CPU instead of GPU
- Split long audio into chunks
Slow processing
Slow processing
Optimizations:
- Enable GPU acceleration
- Use smaller model
- Process shorter segments
- Use WHISPER_TINY for drafts
Unsupported format
Unsupported format
Solution: Convert audio to supported format (WAV, MP3, FLAC)
Best Practices
- Audio quality: Use high-quality recordings (16kHz+ sample rate)
- Model selection: Start with SMALL, adjust based on results
- GPU usage: Enable GPU for faster processing
- Batch processing: Process multiple files in parallel
- Timeout protection: Set
document_timeoutfor long files - Format conversion: Convert to WAV for best compatibility
Use Cases
Meeting Transcription
Convert recorded meetings to searchable text
Interview Processing
Transcribe interviews for qualitative research
Podcast/Lecture
Create text versions of audio content
Voice Notes
Convert voice memos to text
See Also
- Backends Overview - Backend architecture
- Pipeline Options - ASR configuration
- AcceleratorOptions - GPU acceleration
- DocumentConverter - Main conversion API