AsrPipeline

Overview

AsrPipeline converts audio files to text using Automatic Speech Recognition (ASR) models. It supports multiple Whisper implementations and provides timestamped transcriptions with optional speaker identification.

Class Signature

class AsrPipeline(BasePipeline):
    def __init__(self, pipeline_options: AsrPipelineOptions)

Parameters

pipeline_options

AsrPipelineOptions

required

Configuration options for the ASR pipeline

Show properties

asr_options

InlineAsrNativeWhisperOptions | InlineAsrMlxWhisperOptions

required

ASR model configuration

artifacts_path

str | None

Path to directory containing model artifacts

accelerator_options

AcceleratorOptions

Hardware acceleration configuration (device, num_threads, etc.)

ASR Options

Native Whisper Options

Uses OpenAI’s native Whisper implementation:

from docling.datamodel.pipeline_options_asr_model import (
    InlineAsrNativeWhisperOptions
)

asr_options = InlineAsrNativeWhisperOptions(
    repo_id="base",  # Model size: tiny, base, small, medium, large
    temperature=0.0,
    max_new_tokens=448,
    verbose=False,
    timestamps=True,
    word_timestamps=True
)

repo_id

str

default:"base"

Whisper model size: tiny, base, small, medium, large, large-v2, large-v3

temperature

float

default:"0.0"

Sampling temperature for generation

max_new_tokens

int

default:"448"

Maximum number of tokens to generate

verbose

bool

default:"false"

Enable verbose output during transcription

timestamps

bool

default:"true"

Include segment-level timestamps

word_timestamps

bool

default:"false"

Include word-level timestamps

MLX Whisper Options

Uses MLX-optimized Whisper for Apple Silicon:

from docling.datamodel.pipeline_options_asr_model import (
    InlineAsrMlxWhisperOptions
)

asr_options = InlineAsrMlxWhisperOptions(
    repo_id="mlx-community/whisper-base",
    language="en",
    task="transcribe",
    word_timestamps=True,
    no_speech_threshold=0.6,
    logprob_threshold=-1.0,
    compression_ratio_threshold=2.4
)

repo_id

str

required

HuggingFace repository ID for MLX Whisper model

language

str | None

Language code (e.g., “en”, “es”, “fr”). Auto-detected if None

task

str

default:"transcribe"

Task type: transcribe or translate (to English)

word_timestamps

bool

default:"false"

Include word-level timestamps

no_speech_threshold

float

default:"0.6"

Threshold for detecting silence

logprob_threshold

float

default:"-1.0"

Log probability threshold for filtering

compression_ratio_threshold

float

default:"2.4"

Compression ratio threshold for quality filtering

Methods

execute

Executes the pipeline on an input audio file.

def execute(
    self,
    in_doc: InputDocument,
    raises_on_error: bool
) -> ConversionResult

in_doc

InputDocument

required

Input audio file to transcribe (WAV, MP3, etc.)

raises_on_error

bool

required

If True, raises exceptions on errors; otherwise captures them in ConversionResult

return

ConversionResult

Conversion result containing transcribed text with timestamps

get_default_options

Returns default pipeline options.

@classmethod
def get_default_options(cls) -> AsrPipelineOptions

return

AsrPipelineOptions

Default configuration for AsrPipeline

is_backend_supported

Checks if a backend is supported by this pipeline.

@classmethod
def is_backend_supported(cls, backend: AbstractDocumentBackend) -> bool

backend

AbstractDocumentBackend

required

Backend instance to check

return

bool

True if backend is NoOpBackend, False otherwise

Output Format

The pipeline produces a DoclingDocument with text items containing:

text: Transcribed content
source: TrackSource with timing information
- start_time: Segment start in seconds
- end_time: Segment end in seconds
- voice: Speaker identifier (if available)

Usage Examples

Native Whisper

from docling.pipeline.asr_pipeline import AsrPipeline
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.datamodel.pipeline_options_asr_model import (
    InlineAsrNativeWhisperOptions
)
from docling.datamodel.document import InputDocument
from docling.datamodel.base_models import InputFormat

# Configure pipeline
asr_options = InlineAsrNativeWhisperOptions(
    repo_id="base",
    word_timestamps=True
)

pipeline_options = AsrPipelineOptions(
    asr_options=asr_options
)

# Create pipeline
pipeline = AsrPipeline(pipeline_options=pipeline_options)

# Process audio file
input_doc = InputDocument(
    path_or_stream="meeting.wav",
    format=InputFormat.AUDIO
)
result = pipeline.execute(input_doc, raises_on_error=False)

if result.status == ConversionStatus.SUCCESS:
    # Export transcript
    for item in result.document.texts:
        track = item.source
        print(f"[{track.start_time:.2f}s - {track.end_time:.2f}s]: {item.text}")

MLX Whisper (Apple Silicon)

from docling.datamodel.pipeline_options_asr_model import (
    InlineAsrMlxWhisperOptions
)

# Configure MLX pipeline
asr_options = InlineAsrMlxWhisperOptions(
    repo_id="mlx-community/whisper-large-v3",
    language="en",
    word_timestamps=True
)

pipeline_options = AsrPipelineOptions(
    asr_options=asr_options
)

pipeline = AsrPipeline(pipeline_options=pipeline_options)

# Process audio
result = pipeline.execute(input_doc, raises_on_error=False)

Export to Different Formats

if result.status == ConversionStatus.SUCCESS:
    # Export to Markdown
    markdown = result.document.export_to_markdown()
    
    # Export to text
    text = result.document.export_to_text()
    
    # Export to JSON
    json_output = result.document.export_to_json()

Supported Audio Formats

The pipeline supports common audio formats:

WAV
MP3
FLAC
M4A
OGG

For BytesIO streams, the format is inferred from the filename extension.

Error Handling

Zero-Duration Segments

The pipeline automatically handles zero-duration ASR segments:

# If end_time <= start_time, adds small epsilon (0.001s)
# to create valid time range

Failed Transcription

If transcription fails:

Status is set to ConversionStatus.FAILURE
Error details are captured in ConversionResult.errors
Empty document results in ConversionStatus.PARTIAL_SUCCESS

Installation Requirements

Native Whisper

pip install openai-whisper
# or
uv sync --extra asr

MLX Whisper

pip install mlx-whisper
# or
uv sync --extra asr

MLX Whisper is only available on Apple Silicon (M1/M2/M3) devices.

Performance Considerations

Model Size: Larger models (medium, large) are more accurate but slower
Word Timestamps: Enable for detailed timing but increases processing time
GPU Acceleration: Use CUDA/MLX for faster inference
Batch Processing: Process multiple files sequentially
Language Specification: Setting language explicitly improves speed

Model Recommendations

Use Case	Native Whisper	MLX Whisper
Quick transcription	`tiny` or `base`	`whisper-tiny`
Balanced quality	`small` or `medium`	`whisper-small`
High accuracy	`large-v3`	`whisper-large-v3`
Production	`large-v3`	`whisper-large-v3-turbo`

Core API

Pipelines

Options & Configuration

Backends

CLI

Overview

Class Signature

Parameters

ASR Options

Native Whisper Options

MLX Whisper Options

Methods

execute

get_default_options

is_backend_supported

Output Format

Usage Examples

Native Whisper

MLX Whisper (Apple Silicon)

Export to Different Formats

Supported Audio Formats

Error Handling

Zero-Duration Segments

Failed Transcription

Installation Requirements

Native Whisper

MLX Whisper

Performance Considerations

Model Recommendations

Build docs developers (and LLMs) love

Core API

Pipelines

Options & Configuration

Backends

CLI

​Overview

​Class Signature

​Parameters

​ASR Options

​Native Whisper Options

​MLX Whisper Options

​Methods

​execute

​get_default_options

​is_backend_supported

​Output Format

​Usage Examples

​Native Whisper

​MLX Whisper (Apple Silicon)

​Export to Different Formats

​Supported Audio Formats

​Error Handling

​Zero-Duration Segments

​Failed Transcription

​Installation Requirements

​Native Whisper

​MLX Whisper

​Performance Considerations

​Model Recommendations

Build docs developers (and LLMs) love

Overview

Class Signature

Parameters

ASR Options

Native Whisper Options

MLX Whisper Options

Methods

execute

get_default_options

is_backend_supported

Output Format

Usage Examples

Native Whisper

MLX Whisper (Apple Silicon)

Export to Different Formats

Supported Audio Formats

Error Handling

Zero-Duration Segments

Failed Transcription

Installation Requirements

Native Whisper

MLX Whisper

Performance Considerations

Model Recommendations