Skip to main content
MarkItDown can extract metadata from audio files and transcribe speech content using Google’s speech recognition service.

Supported Formats

  • WAV: .wav (uncompressed audio)
  • MP3: .mp3 (MPEG audio)
  • M4A: .m4a (AAC audio)
  • MP4: .mp4 (video files with audio tracks)

Dependencies

Metadata Extraction (Optional)

# macOS
brew install exiftool

# Ubuntu/Debian
sudo apt-get install libimage-exiftool-perl

# Windows
# Download from https://exiftool.org/

Speech Transcription

pip install SpeechRecognition pydub
Or install with audio extras:
pip install markitdown[audio-transcription]
# or
pip install markitdown[all]
Audio Format Support: The pydub library requires ffmpeg or libav for MP3/M4A/MP4 formats:
  • macOS: brew install ffmpeg
  • Ubuntu/Debian: sudo apt-get install ffmpeg
  • Windows: Download from ffmpeg.org

Features

EXIF Metadata

Extract artist, album, genre, and technical details

Speech Transcription

Convert speech to text using Google Speech Recognition

Audio Properties

Sample rate, bit depth, channels, duration

Multiple Formats

Support for WAV, MP3, M4A, and MP4

Basic Usage

from markitdown import MarkItDown

md = MarkItDown(exiftool_path="/usr/local/bin/exiftool")
result = md.convert("recording.wav")
print(result.markdown)

Output Examples

With Metadata and Transcription

Title: Interview with Jane Doe
Artist: John Smith
Album: Tech Talks Podcast
Genre: Podcast
DateTimeOriginal: 2024:02:15 10:00:00
CreateDate: 2024:02:15 10:00:00
NumChannels: 2
SampleRate: 44100
BitsPerSample: 16

### Audio Transcript:
Welcome to today's episode. Today we're talking with Jane Doe about the future of artificial intelligence. Jane, thanks for joining us. Thanks for having me. Let's start with your background in machine learning.

Transcription Only

### Audio Transcript:
This is a test recording. The quick brown fox jumps over the lazy dog. Testing one two three.

No Speech Detected

Title: Background Music
Artist: Various Artists
Genre: Instrumental

### Audio Transcript:
[No speech detected]

Metadata Fields

The converter extracts the following metadata fields (when available via ExifTool):
FieldDescriptionExample
TitleTrack titleEpisode 42: AI Ethics
ArtistPerformer/artistJane Smith
AuthorContent authorJohn Doe
BandBand/group nameThe Tech Podcast
AlbumAlbum/collectionSeason 2
GenreMusic genrePodcast, Jazz, Speech
TrackTrack number5/12
DateTimeOriginalRecording date/time2024:02:15 14:30:00
CreateDateFile creation date2024:02:15 14:30:00
NumChannelsAudio channels2 (stereo), 1 (mono)
SampleRateSample rate in Hz44100, 48000
AvgBytesPerSecBit rate128000
BitsPerSampleBit depth16, 24
Note: Duration is not extracted when reading from memory streams due to potential inaccuracies.

Speech Transcription

How It Works

  1. Format Detection: Automatically detects audio format from extension/MIME type
  2. Format Conversion: Non-WAV formats (MP3, M4A, MP4) are converted to WAV using pydub
  3. Speech Recognition: Google Speech Recognition API transcribes the audio
  4. Output: Transcript added under ### Audio Transcript: heading

Supported Audio Formats

WAV files are processed directly without conversion:
md = MarkItDown()
result = md.convert("recording.wav")
Formats: AIFF and FLAC are also processed directly.
MP3 files are converted to WAV before transcription:
md = MarkItDown()
result = md.convert("podcast.mp3")  # Automatically converts to WAV
Requires: ffmpeg or libav installed.
M4A and MP4 files extract audio track and convert to WAV:
md = MarkItDown()
result = md.convert("video.mp4")  # Extracts audio and transcribes
Requires: ffmpeg or libav installed.

Transcription Limitations

  • Internet Required: Google Speech Recognition requires an internet connection
  • Language: Currently only supports English by default
  • Length: Very long audio files may fail or take considerable time
  • Quality: Transcription accuracy depends on audio quality, accent, background noise
  • API Limits: Google’s free tier has usage limits

Implementation Details

Source Location

packages/markitdown/src/markitdown/converters/
├── _audio_converter.py      # Main audio converter
├── _transcribe_audio.py     # Speech transcription logic
└── _exiftool.py             # ExifTool metadata extraction

Converter Class

  • Class Name: AudioConverter
  • Accepted Extensions: .wav, .mp3, .m4a, .mp4
  • MIME Types: audio/x-wav, audio/mpeg, video/mp4

Transcription Function

def transcribe_audio(file_stream: BinaryIO, *, audio_format: str = "wav") -> str:
    # Convert to WAV if needed
    if audio_format in ["mp3", "mp4"]:
        audio_segment = pydub.AudioSegment.from_file(file_stream, format=audio_format)
        audio_source = io.BytesIO()
        audio_segment.export(audio_source, format="wav")
        audio_source.seek(0)
    else:
        audio_source = file_stream
    
    # Transcribe with Google Speech Recognition
    recognizer = sr.Recognizer()
    with sr.AudioFile(audio_source) as source:
        audio = recognizer.record(source)
        transcript = recognizer.recognize_google(audio).strip()
        return "[No speech detected]" if transcript == "" else transcript

Advanced Examples

Batch Transcription

from markitdown import MarkItDown
import os

md = MarkItDown()
audio_dir = "recordings"

for filename in os.listdir(audio_dir):
    if filename.endswith(('.wav', '.mp3', '.m4a')):
        filepath = os.path.join(audio_dir, filename)
        print(f"Processing {filename}...")
        
        result = md.convert(filepath)
        
        # Save transcript
        output_path = filepath.replace(os.path.splitext(filepath)[1], '.md')
        with open(output_path, 'w') as f:
            f.write(result.markdown)
        
        print(f"Saved to {output_path}")

Extract Only Metadata

from markitdown import MarkItDown
from markitdown._exceptions import MissingDependencyException

md = MarkItDown(exiftool_path="/usr/local/bin/exiftool")

try:
    result = md.convert("music.mp3")
except MissingDependencyException:
    # Transcription not available, but metadata still extracted
    print("Transcription unavailable, showing metadata only")
    print(result.markdown)

Podcast Episode Processing

from markitdown import MarkItDown
import re

md = MarkItDown(exiftool_path="/usr/local/bin/exiftool")
result = md.convert("podcast_episode.mp3")

# Extract title from metadata
metadata_lines = result.markdown.split('\n')
title = next((line.split(':', 1)[1].strip() for line in metadata_lines if line.startswith('Title:')), 'Unknown')

# Extract transcript
transcript_marker = '### Audio Transcript:'
if transcript_marker in result.markdown:
    transcript = result.markdown.split(transcript_marker)[1].strip()
    print(f"Episode: {title}")
    print(f"Transcript length: {len(transcript)} characters")
else:
    print("No transcript generated")

Convert Video to Transcript

from markitdown import MarkItDown

md = MarkItDown()

# Extract and transcribe audio from video
result = md.convert("presentation.mp4")

# result.markdown contains transcription of spoken content
with open('presentation_transcript.md', 'w') as f:
    f.write(f"# Presentation Transcript\n\n{result.markdown}")

Error Handling

from markitdown import MarkItDown
from markitdown._exceptions import MissingDependencyException

md = MarkItDown()

try:
    result = md.convert("audio.mp3")
    print(result.markdown)
except MissingDependencyException as e:
    print("Install transcription dependencies:")
    print("pip install markitdown[audio-transcription]")
except FileNotFoundError:
    print("ffmpeg not found. Install it for MP3/M4A support:")
    print("macOS: brew install ffmpeg")
    print("Ubuntu: sudo apt-get install ffmpeg")
except Exception as e:
    print(f"Transcription failed: {e}")
    print("This may be due to:")
    print("- No speech in the audio")
    print("- Poor audio quality")
    print("- No internet connection (Google API required)")

Use Cases

Record meetings and automatically generate searchable transcripts with speaker metadata.
Extract episode metadata and transcripts for podcast archives and show notes.
Convert audio interviews to text for analysis and quotation.
Transcribe voice memos and extract creation dates for organization.
Extract spoken content from video files for searchability.
Extract and catalog metadata from audio file collections.

Next Steps

Image Formats

Learn about image conversion with metadata extraction

Video Processing

Extract transcripts from YouTube videos

Build docs developers (and LLMs) love