MarkItDown can integrate with Large Language Models (LLMs) to generate detailed descriptions for images during conversion, making the output more accessible and useful for text analysis.
Overview
When an LLM client is configured, MarkItDown:
- Extracts image metadata using ExifTool (if available)
- Encodes the image as a base64 data URI
- Sends the image to the LLM with a prompt
- Includes the AI-generated description in the Markdown output
LLM integration is currently supported for JPEG and PNG images only.
Prerequisites
Install OpenAI Package
The openai package is not included in MarkItDown’s dependencies and must be installed separately. Get API Key
Obtain an API key from OpenAI or your LLM provider Optional: Install ExifTool
For enhanced image metadata extraction:# macOS
brew install exiftool
# Ubuntu/Debian
sudo apt-get install libimage-exiftool-perl
Basic Usage
Python API
from markitdown import MarkItDown
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key="your-api-key")
# Create MarkItDown with LLM integration
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o"
)
# Convert an image
result = md.convert("photo.jpg")
print(result.markdown)
Output:
ImageSize: 1920x1080
DateTimeOriginal: 2024:02:28 10:30:00
GPSPosition: 37.7749° N, 122.4194° W
# Description:
A scenic view of the Golden Gate Bridge during sunset, with vibrant orange and pink hues reflecting off the water. The iconic suspension bridge spans across the bay, with the city of San Francisco visible in the background.
Configuration Options
LLM Client
Any OpenAI-compatible client that supports the chat completions API:
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o"
)
LLM Model
Specify the model to use for image descriptions:
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o", # GPT-4 with vision
# llm_model="gpt-4o-mini", # Faster, cheaper
# llm_model="gpt-4-turbo", # Previous generation
)
Supported models (vision-capable):
gpt-4o - Recommended, high quality
gpt-4o-mini - Faster, more cost-effective
gpt-4-turbo - Previous generation
gpt-4-vision-preview - Legacy
Custom Prompt
Customize the prompt sent to the LLM:
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this image in detail, focusing on key objects, colors, and composition."
)
Default prompt:
Write a detailed caption for this image.
Per-Conversion Override
Override settings for individual conversions:
md = MarkItDown()
# Convert with LLM
result = md.convert(
"photo.jpg",
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe the main subject of this image."
)
Implementation Details
Image Processing
The LLM integration in _image_converter.py:
def _get_llm_description(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
*,
client,
model,
prompt=None,
) -> Union[None, str]:
if prompt is None or prompt.strip() == "":
prompt = "Write a detailed caption for this image."
# Get MIME type
content_type = stream_info.mimetype
if not content_type:
content_type, _ = mimetypes.guess_type("_dummy" + (stream_info.extension or ""))
if not content_type:
content_type = "application/octet-stream"
# Encode image as base64
cur_pos = file_stream.tell()
try:
base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
finally:
file_stream.seek(cur_pos)
# Create data URI
data_uri = f"data:{content_type};base64,{base64_image}"
# Call OpenAI API
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": data_uri},
},
],
}
]
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content
Supported Image Types
LLM descriptions are generated for:
.jpg, .jpeg (JPEG images)
.png (PNG images)
Other image formats are not currently supported.
Advanced Examples
Multiple Images with Different Prompts
from markitdown import MarkItDown
from openai import OpenAI
from pathlib import Path
client = OpenAI(api_key="your-api-key")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
image_types = {
"product": "Describe this product image for an e-commerce catalog.",
"portrait": "Describe this portrait photo, including mood and setting.",
"landscape": "Describe this landscape, highlighting natural features.",
}
for image_file in Path("images").glob("*.jpg"):
# Determine image type from filename or metadata
image_type = "landscape" # Example
prompt = image_types.get(image_type, "Describe this image in detail.")
result = md.convert(
str(image_file),
llm_prompt=prompt
)
output_file = Path("output") / f"{image_file.stem}.md"
output_file.write_text(result.markdown)
Accessibility Captions
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
accessibility_prompt = """
Create an accessible description of this image for screen reader users.
Include:
- Main subject and action
- Important details and context
- Colors if relevant
- Spatial relationships
Keep it concise but informative.
"""
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt=accessibility_prompt
)
result = md.convert("diagram.png")
print(result.markdown)
Content Moderation
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
moderation_prompt = """
Describe this image and indicate if it contains:
- Inappropriate content
- Sensitive information
- Brand logos or trademarks
Provide a neutral, factual description.
"""
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt=moderation_prompt
)
result = md.convert("user_upload.jpg")
if "inappropriate" in result.markdown.lower():
print("Warning: Image may contain inappropriate content")
Custom OpenAI Client Configuration
from openai import OpenAI
from markitdown import MarkItDown
# Custom timeout and retry settings
client = OpenAI(
api_key="your-api-key",
timeout=30.0,
max_retries=3,
)
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o"
)
Azure OpenAI
from openai import AzureOpenAI
from markitdown import MarkItDown
# Azure OpenAI client
client = AzureOpenAI(
api_key="your-azure-api-key",
api_version="2024-02-01",
azure_endpoint="https://your-resource.openai.azure.com/"
)
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o" # Your Azure deployment name
)
result = md.convert("image.jpg")
Alternative LLM Providers
Any OpenAI-compatible API:
from openai import OpenAI
from markitdown import MarkItDown
# Example: Using a compatible API endpoint
client = OpenAI(
api_key="your-api-key",
base_url="https://api.alternative-provider.com/v1"
)
md = MarkItDown(
llm_client=client,
llm_model="vision-model-name"
)
Cost Considerations
LLM API calls incur costs. Each image conversion with LLM integration makes one API request.
Cost factors:
- Model choice:
gpt-4o-mini is more cost-effective than gpt-4o
- Image size: Larger images consume more tokens
- Prompt length: Longer prompts increase costs
- Frequency: Each image conversion = one API call
Estimated costs (as of 2024):
gpt-4o: ~$0.01-0.02 per image
gpt-4o-mini: ~$0.001-0.002 per image
Minimize Costs
from markitdown import MarkItDown
from openai import OpenAI
from pathlib import Path
client = OpenAI(api_key="your-api-key")
# Only use LLM for specific images
md_with_llm = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
md_without_llm = MarkItDown()
for image_file in Path("images").glob("*.jpg"):
# Use LLM only for important images
if "important" in image_file.stem:
result = md_with_llm.convert(str(image_file))
else:
result = md_without_llm.convert(str(image_file))
Error Handling
from markitdown import MarkItDown
from openai import OpenAI, OpenAIError
client = OpenAI(api_key="your-api-key")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
try:
result = md.convert("image.jpg")
except OpenAIError as e:
print(f"OpenAI API error: {e}")
# Fall back to conversion without LLM
md_fallback = MarkItDown()
result = md_fallback.convert("image.jpg")
except Exception as e:
print(f"Conversion error: {e}")
Without LLM Integration
If no LLM client is configured, image conversion includes only metadata:
from markitdown import MarkItDown
md = MarkItDown() # No LLM client
result = md.convert("photo.jpg")
print(result.markdown)
Output:
ImageSize: 1920x1080
DateTimeOriginal: 2024:02:28 10:30:00
GPSPosition: 37.7749° N, 122.4194° W
Troubleshooting
OpenAI Package Not Installed
ImportError: No module named 'openai'
Solution:
Invalid API Key
AuthenticationError: Invalid API key
Check:
- API key is correct
- API key has not expired
- Account has available credits
Model Not Found
NotFoundError: Model 'gpt-4o' not found
Ensure:
- Model name is correct
- Your account has access to the model
- Using correct API endpoint (OpenAI vs Azure)
No Description Generated
If the description is missing:
# Check if LLM client and model are set
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o" # Must be specified
)
# Verify image format is supported
result = md.convert("image.jpg") # .jpg or .png only
Best Practices
- Use
gpt-4o-mini for cost-effective descriptions
- Customize prompts for your specific use case
- Handle API errors gracefully with fallbacks
- Consider caching results for repeated conversions
- Monitor API usage and costs
- Use appropriate image sizes (resize large images if needed)