Overview
AsrPipeline converts audio files to text using Automatic Speech Recognition (ASR) models. It supports multiple Whisper implementations and provides timestamped transcriptions with optional speaker identification.
Class Signature
Parameters
Configuration options for the ASR pipeline
ASR Options
Native Whisper Options
Uses OpenAI’s native Whisper implementation:Whisper model size:
tiny, base, small, medium, large, large-v2, large-v3Sampling temperature for generation
Maximum number of tokens to generate
Enable verbose output during transcription
Include segment-level timestamps
Include word-level timestamps
MLX Whisper Options
Uses MLX-optimized Whisper for Apple Silicon:HuggingFace repository ID for MLX Whisper model
Language code (e.g., “en”, “es”, “fr”). Auto-detected if None
Task type:
transcribe or translate (to English)Include word-level timestamps
Threshold for detecting silence
Log probability threshold for filtering
Compression ratio threshold for quality filtering
Methods
execute
Executes the pipeline on an input audio file.Input audio file to transcribe (WAV, MP3, etc.)
If True, raises exceptions on errors; otherwise captures them in ConversionResult
Conversion result containing transcribed text with timestamps
get_default_options
Returns default pipeline options.Default configuration for AsrPipeline
is_backend_supported
Checks if a backend is supported by this pipeline.Backend instance to check
True if backend is NoOpBackend, False otherwise
Output Format
The pipeline produces aDoclingDocument with text items containing:
- text: Transcribed content
- source:
TrackSourcewith timing informationstart_time: Segment start in secondsend_time: Segment end in secondsvoice: Speaker identifier (if available)
Usage Examples
Native Whisper
MLX Whisper (Apple Silicon)
Export to Different Formats
Supported Audio Formats
The pipeline supports common audio formats:- WAV
- MP3
- FLAC
- M4A
- OGG
Error Handling
Zero-Duration Segments
The pipeline automatically handles zero-duration ASR segments:Failed Transcription
If transcription fails:- Status is set to
ConversionStatus.FAILURE - Error details are captured in
ConversionResult.errors - Empty document results in
ConversionStatus.PARTIAL_SUCCESS
Installation Requirements
Native Whisper
MLX Whisper
MLX Whisper is only available on Apple Silicon (M1/M2/M3) devices.
Performance Considerations
- Model Size: Larger models (medium, large) are more accurate but slower
- Word Timestamps: Enable for detailed timing but increases processing time
- GPU Acceleration: Use CUDA/MLX for faster inference
- Batch Processing: Process multiple files sequentially
- Language Specification: Setting language explicitly improves speed
Model Recommendations
| Use Case | Native Whisper | MLX Whisper |
|---|---|---|
| Quick transcription | tiny or base | whisper-tiny |
| Balanced quality | small or medium | whisper-small |
| High accuracy | large-v3 | whisper-large-v3 |
| Production | large-v3 | whisper-large-v3-turbo |