Skip to main content

Overview

Docling’s ASR (Automatic Speech Recognition) pipeline converts audio and video files into structured DoclingDocument objects, just like PDFs or Word documents. This enables:
  • Unified format: Audio transcripts as DoclingDocument for consistent processing
  • Multiple exports: Markdown, JSON, HTML, DocTags from audio/video
  • RAG integration: Direct use with LangChain, LlamaIndex for searchable archives
  • Automatic hardware optimization: MLX Whisper on Apple Silicon, native Whisper elsewhere

Supported Formats

Audio Formats

  • WAV
  • MP3
  • M4A
  • AAC
  • OGG
  • FLAC

Video Formats

  • MP4
  • AVI
  • MOV
Audio track is automatically extracted
FFmpeg required for M4A, AAC, OGG, FLAC, and all video formats. Install with your package manager:
  • macOS: brew install ffmpeg
  • Ubuntu/Debian: apt-get install ffmpeg
  • Windows: Download from ffmpeg.org

Installation

Install Docling with ASR extras:
pip install "docling[asr]"

Basic Usage

Transcribe an audio file:
from pathlib import Path
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline

# Configure ASR pipeline
pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

# Convert audio file
result = converter.convert(Path("recording.mp3"))
doc = result.document

# Export to Markdown
print(doc.export_to_markdown())

Video Transcription

The same code works for video files:
from pathlib import Path
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline

pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

# Audio track is automatically extracted
result = converter.convert(Path("meeting.mp4"))
print(result.document.export_to_markdown())
Docling automatically extracts the audio track from video files. You don’t need to run FFmpeg manually.

Output Format

Transcripts are generated as paragraph-level Markdown with timestamps:
[time: 0.0-4.0]  Shakespeare on Scenery by Oscar Wilde

[time: 5.28-9.96]  This is a LibriVox recording. All LibriVox recordings are in the public domain.

[time: 10.0-15.5]  The complete text of the essay appears in the book Intentions...
Each segment includes:
  • Timestamp range: Start and end time in seconds
  • Transcribed text: Recognized speech content
  • Paragraph breaks: Automatic segmentation

Available Models

Docling supports multiple Whisper model sizes:
from docling.datamodel import asr_model_specs

# Available models (smallest to largest)
pipeline_options.asr_options = asr_model_specs.WHISPER_TINY      # Fastest, lowest accuracy
pipeline_options.asr_options = asr_model_specs.WHISPER_BASE      # Fast
pipeline_options.asr_options = asr_model_specs.WHISPER_SMALL     # Balanced
pipeline_options.asr_options = asr_model_specs.WHISPER_MEDIUM    # Better accuracy
pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE_V3  # Best accuracy
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO     # Recommended (default)

Model Comparison

ModelSpeedAccuracyMemoryUse Case
WHISPER_TURBOFastExcellentLowRecommended - Best balance
WHISPER_TINYFastestBasicMinimalQuick previews, low-resource
WHISPER_BASEVery FastGoodLowSimple audio, speed priority
WHISPER_SMALLFastGoodMediumGeneral purpose
WHISPER_MEDIUMMediumBetterMediumHigher accuracy needs
WHISPER_LARGE_V3SlowBestHighMaximum accuracy
WHISPER_TURBO is the default and recommended for most use cases. It provides excellent accuracy with fast processing.

Hardware Acceleration

Docling automatically selects the best implementation:
On M1/M2/M3 Macs, Docling automatically uses MLX Whisper for 5-10x faster processing:
from docling.datamodel import asr_model_specs

# Automatically uses mlx-whisper on Apple Silicon
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
Installation (optional):
pip install mlx-whisper
MLX Whisper is automatically detected and used if installed.

Export Formats

Export transcripts to various formats:
# Paragraph-level Markdown with timestamps
markdown = doc.export_to_markdown()
print(markdown)

# Save to file
doc.save_as_markdown("transcript.md")

Batch Transcription

Process multiple audio/video files:
from pathlib import Path
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat, ConversionStatus
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline

pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

# Find all audio files
audio_dir = Path("recordings/")
audio_files = list(audio_dir.glob("*.mp3")) + list(audio_dir.glob("*.mp4"))

print(f"Found {len(audio_files)} files")

# Batch process
for result in converter.convert_all(audio_files, raises_on_error=False):
    if result.status == ConversionStatus.SUCCESS:
        output_path = f"transcripts/{result.input.file.stem}.md"
        result.document.save_as_markdown(output_path)
        print(f"Transcribed: {result.input.file.name} -> {output_path}")
    else:
        print(f"Failed: {result.input.file.name}")

RAG Integration

Build searchable knowledge bases from audio archives:

LangChain Integration

from langchain_docling import DoclingLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load and chunk all audio files in a directory
loader = DoclingLoader("recordings/")
docs = loader.load()

# Embed and index
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

# Query in natural language
results = retriever.invoke("What did we decide about the auth service?")
for doc in results:
    print(doc.page_content)

Standalone Transcription Script

Complete example for processing a directory:
from pathlib import Path
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline

def transcribe_directory(input_dir: Path, output_dir: Path):
    """Transcribe all audio/video files in a directory."""
    output_dir.mkdir(parents=True, exist_ok=True)
    
    pipeline_options = AsrPipelineOptions()
    pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
    
    converter = DocumentConverter(
        format_options={
            InputFormat.AUDIO: AudioFormatOption(
                pipeline_cls=AsrPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )
    
    # Find all supported files
    files = []
    for pattern in ["*.mp3", "*.wav", "*.m4a", "*.mp4", "*.mov"]:
        files.extend(input_dir.glob(pattern))
    
    print(f"Found {len(files)} files to transcribe")
    
    for result in converter.convert_all(files, raises_on_error=False):
        if result.status == ConversionStatus.SUCCESS:
            output_path = output_dir / f"{result.input.file.stem}.md"
            result.document.save_as_markdown(output_path)
            print(f"✓ {result.input.file.name}")
        else:
            print(f"✗ {result.input.file.name}")

if __name__ == "__main__":
    transcribe_directory(
        input_dir=Path("recordings"),
        output_dir=Path("transcripts"),
    )

Use Cases

Meeting Archives

Make recorded meetings searchable. Process company all-hands, customer calls, and design reviews into a queryable knowledge base.

Podcast Transcription

Generate transcripts for podcasts, enabling search, SEO, and accessibility.

Lecture Notes

Convert recorded lectures to searchable text for study materials and accessibility.

Video Documentation

Extract text from tutorial videos, demos, and documentation for text-based search.

Limitations

LimitationWorkaround
No SRT/WebVTT subtitle outputUse openai-whisper CLI: whisper audio.mp3 --output_format srt
No speaker diarizationUse pyannote-audio as pre/post-processing
No word-level timestampsNot available in current export formats
Paragraph-level onlySufficient for RAG, search, and summarization use cases
For subtitle generation workflows, use the openai-whisper CLI directly. Docling’s ASR pipeline is optimized for knowledge retrieval (RAG, search, summarization).

Troubleshooting

Install FFmpeg:
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Verify installation
ffmpeg -version
  • Use a smaller model: asr_model_specs.WHISPER_BASE
  • Install MLX Whisper on Apple Silicon: pip install mlx-whisper
  • Enable GPU acceleration on CUDA systems
  • Process files in parallel batches
  • Use a larger model: asr_model_specs.WHISPER_LARGE_V3
  • Ensure audio quality is good (clear speech, minimal background noise)
  • For multilingual audio, Whisper automatically detects languages
  • Check that audio format is supported
  • Use a smaller Whisper model
  • Process fewer files concurrently
  • Split long audio files into chunks

Next Steps

Export Formats

Learn about all export format options

Batch Processing

Optimize processing for large audio archives

LangChain Integration

Build RAG pipelines with transcribed audio

Advanced Options

Configure hardware acceleration and timeouts

Build docs developers (and LLMs) love