MarkItDown can extract metadata from audio files and transcribe speech content using Google’s speech recognition service.
WAV : .wav (uncompressed audio)
MP3 : .mp3 (MPEG audio)
M4A : .m4a (AAC audio)
MP4 : .mp4 (video files with audio tracks)
Dependencies
# macOS
brew install exiftool
# Ubuntu/Debian
sudo apt-get install libimage-exiftool-perl
# Windows
# Download from https://exiftool.org/
Speech Transcription
pip install SpeechRecognition pydub
Or install with audio extras:
pip install markitdown[audio-transcription]
# or
pip install markitdown[all]
Audio Format Support : The pydub library requires ffmpeg or libav for MP3/M4A/MP4 formats:
macOS : brew install ffmpeg
Ubuntu/Debian : sudo apt-get install ffmpeg
Windows : Download from ffmpeg.org
Features
EXIF Metadata Extract artist, album, genre, and technical details
Speech Transcription Convert speech to text using Google Speech Recognition
Audio Properties Sample rate, bit depth, channels, duration
Multiple Formats Support for WAV, MP3, M4A, and MP4
Basic Usage
Python (Metadata + Transcription)
Python (Transcription Only)
CLI
from markitdown import MarkItDown
md = MarkItDown( exiftool_path = "/usr/local/bin/exiftool" )
result = md.convert( "recording.wav" )
print (result.markdown)
Output Examples
Title: Interview with Jane Doe
Artist: John Smith
Album: Tech Talks Podcast
Genre: Podcast
DateTimeOriginal: 2024:02:15 10:00:00
CreateDate: 2024:02:15 10:00:00
NumChannels: 2
SampleRate: 44100
BitsPerSample: 16
### Audio Transcript:
Welcome to today's episode. Today we're talking with Jane Doe about the future of artificial intelligence. Jane, thanks for joining us. Thanks for having me. Let's start with your background in machine learning.
Transcription Only
### Audio Transcript:
This is a test recording. The quick brown fox jumps over the lazy dog. Testing one two three.
No Speech Detected
Title: Background Music
Artist: Various Artists
Genre: Instrumental
### Audio Transcript:
[No speech detected]
The converter extracts the following metadata fields (when available via ExifTool):
Field Description Example TitleTrack title Episode 42: AI EthicsArtistPerformer/artist Jane SmithAuthorContent author John DoeBandBand/group name The Tech PodcastAlbumAlbum/collection Season 2GenreMusic genre Podcast, Jazz, SpeechTrackTrack number 5/12DateTimeOriginalRecording date/time 2024:02:15 14:30:00CreateDateFile creation date 2024:02:15 14:30:00NumChannelsAudio channels 2 (stereo), 1 (mono)SampleRateSample rate in Hz 44100, 48000AvgBytesPerSecBit rate 128000BitsPerSampleBit depth 16, 24
Note : Duration is not extracted when reading from memory streams due to potential inaccuracies.
Speech Transcription
How It Works
Format Detection : Automatically detects audio format from extension/MIME type
Format Conversion : Non-WAV formats (MP3, M4A, MP4) are converted to WAV using pydub
Speech Recognition : Google Speech Recognition API transcribes the audio
Output : Transcript added under ### Audio Transcript: heading
WAV files are processed directly without conversion: md = MarkItDown()
result = md.convert( "recording.wav" )
Formats: AIFF and FLAC are also processed directly.
MP3 (Requires Conversion)
MP3 files are converted to WAV before transcription: md = MarkItDown()
result = md.convert( "podcast.mp3" ) # Automatically converts to WAV
Requires: ffmpeg or libav installed.
M4A/MP4 (Requires Conversion)
M4A and MP4 files extract audio track and convert to WAV: md = MarkItDown()
result = md.convert( "video.mp4" ) # Extracts audio and transcribes
Requires: ffmpeg or libav installed.
Transcription Limitations
Internet Required : Google Speech Recognition requires an internet connection
Language : Currently only supports English by default
Length : Very long audio files may fail or take considerable time
Quality : Transcription accuracy depends on audio quality, accent, background noise
API Limits : Google’s free tier has usage limits
Implementation Details
Source Location
packages/markitdown/src/markitdown/converters/
├── _audio_converter.py # Main audio converter
├── _transcribe_audio.py # Speech transcription logic
└── _exiftool.py # ExifTool metadata extraction
Converter Class
Class Name : AudioConverter
Accepted Extensions : .wav, .mp3, .m4a, .mp4
MIME Types : audio/x-wav, audio/mpeg, video/mp4
Transcription Function
def transcribe_audio ( file_stream : BinaryIO, * , audio_format : str = "wav" ) -> str :
# Convert to WAV if needed
if audio_format in [ "mp3" , "mp4" ]:
audio_segment = pydub.AudioSegment.from_file(file_stream, format = audio_format)
audio_source = io.BytesIO()
audio_segment.export(audio_source, format = "wav" )
audio_source.seek( 0 )
else :
audio_source = file_stream
# Transcribe with Google Speech Recognition
recognizer = sr.Recognizer()
with sr.AudioFile(audio_source) as source:
audio = recognizer.record(source)
transcript = recognizer.recognize_google(audio).strip()
return "[No speech detected]" if transcript == "" else transcript
Advanced Examples
Batch Transcription
from markitdown import MarkItDown
import os
md = MarkItDown()
audio_dir = "recordings"
for filename in os.listdir(audio_dir):
if filename.endswith(( '.wav' , '.mp3' , '.m4a' )):
filepath = os.path.join(audio_dir, filename)
print ( f "Processing { filename } ..." )
result = md.convert(filepath)
# Save transcript
output_path = filepath.replace(os.path.splitext(filepath)[ 1 ], '.md' )
with open (output_path, 'w' ) as f:
f.write(result.markdown)
print ( f "Saved to { output_path } " )
from markitdown import MarkItDown
from markitdown._exceptions import MissingDependencyException
md = MarkItDown( exiftool_path = "/usr/local/bin/exiftool" )
try :
result = md.convert( "music.mp3" )
except MissingDependencyException:
# Transcription not available, but metadata still extracted
print ( "Transcription unavailable, showing metadata only" )
print (result.markdown)
Podcast Episode Processing
from markitdown import MarkItDown
import re
md = MarkItDown( exiftool_path = "/usr/local/bin/exiftool" )
result = md.convert( "podcast_episode.mp3" )
# Extract title from metadata
metadata_lines = result.markdown.split( ' \n ' )
title = next ((line.split( ':' , 1 )[ 1 ].strip() for line in metadata_lines if line.startswith( 'Title:' )), 'Unknown' )
# Extract transcript
transcript_marker = '### Audio Transcript:'
if transcript_marker in result.markdown:
transcript = result.markdown.split(transcript_marker)[ 1 ].strip()
print ( f "Episode: { title } " )
print ( f "Transcript length: { len (transcript) } characters" )
else :
print ( "No transcript generated" )
Convert Video to Transcript
from markitdown import MarkItDown
md = MarkItDown()
# Extract and transcribe audio from video
result = md.convert( "presentation.mp4" )
# result.markdown contains transcription of spoken content
with open ( 'presentation_transcript.md' , 'w' ) as f:
f.write( f "# Presentation Transcript \n\n { result.markdown } " )
Error Handling
from markitdown import MarkItDown
from markitdown._exceptions import MissingDependencyException
md = MarkItDown()
try :
result = md.convert( "audio.mp3" )
print (result.markdown)
except MissingDependencyException as e:
print ( "Install transcription dependencies:" )
print ( "pip install markitdown[audio-transcription]" )
except FileNotFoundError :
print ( "ffmpeg not found. Install it for MP3/M4A support:" )
print ( "macOS: brew install ffmpeg" )
print ( "Ubuntu: sudo apt-get install ffmpeg" )
except Exception as e:
print ( f "Transcription failed: { e } " )
print ( "This may be due to:" )
print ( "- No speech in the audio" )
print ( "- Poor audio quality" )
print ( "- No internet connection (Google API required)" )
Use Cases
Record meetings and automatically generate searchable transcripts with speaker metadata.
Extract episode metadata and transcripts for podcast archives and show notes.
Convert audio interviews to text for analysis and quotation.
Transcribe voice memos and extract creation dates for organization.
Extract spoken content from video files for searchability.
Extract and catalog metadata from audio file collections.
Next Steps
Image Formats Learn about image conversion with metadata extraction
Video Processing Extract transcripts from YouTube videos