Skip to main content
MarkItDown can integrate with Large Language Models (LLMs) to generate detailed descriptions for images during conversion, making the output more accessible and useful for text analysis.

Overview

When an LLM client is configured, MarkItDown:
  1. Extracts image metadata using ExifTool (if available)
  2. Encodes the image as a base64 data URI
  3. Sends the image to the LLM with a prompt
  4. Includes the AI-generated description in the Markdown output
LLM integration is currently supported for JPEG and PNG images only.

Prerequisites

1

Install OpenAI Package

pip install openai
The openai package is not included in MarkItDown’s dependencies and must be installed separately.
2

Get API Key

Obtain an API key from OpenAI or your LLM provider
3

Optional: Install ExifTool

For enhanced image metadata extraction:
# macOS
brew install exiftool

# Ubuntu/Debian
sudo apt-get install libimage-exiftool-perl

Basic Usage

Python API

from markitdown import MarkItDown
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key="your-api-key")

# Create MarkItDown with LLM integration
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)

# Convert an image
result = md.convert("photo.jpg")
print(result.markdown)
Output:
ImageSize: 1920x1080
DateTimeOriginal: 2024:02:28 10:30:00
GPSPosition: 37.7749° N, 122.4194° W

# Description:
A scenic view of the Golden Gate Bridge during sunset, with vibrant orange and pink hues reflecting off the water. The iconic suspension bridge spans across the bay, with the city of San Francisco visible in the background.

Configuration Options

LLM Client

Any OpenAI-compatible client that supports the chat completions API:
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)

LLM Model

Specify the model to use for image descriptions:
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",              # GPT-4 with vision
    # llm_model="gpt-4o-mini",       # Faster, cheaper
    # llm_model="gpt-4-turbo",       # Previous generation
)
Supported models (vision-capable):
  • gpt-4o - Recommended, high quality
  • gpt-4o-mini - Faster, more cost-effective
  • gpt-4-turbo - Previous generation
  • gpt-4-vision-preview - Legacy

Custom Prompt

Customize the prompt sent to the LLM:
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this image in detail, focusing on key objects, colors, and composition."
)
Default prompt:
Write a detailed caption for this image.

Per-Conversion Override

Override settings for individual conversions:
md = MarkItDown()

# Convert with LLM
result = md.convert(
    "photo.jpg",
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe the main subject of this image."
)

Implementation Details

Image Processing

The LLM integration in _image_converter.py:
def _get_llm_description(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    *,
    client,
    model,
    prompt=None,
) -> Union[None, str]:
    if prompt is None or prompt.strip() == "":
        prompt = "Write a detailed caption for this image."

    # Get MIME type
    content_type = stream_info.mimetype
    if not content_type:
        content_type, _ = mimetypes.guess_type("_dummy" + (stream_info.extension or ""))
    if not content_type:
        content_type = "application/octet-stream"

    # Encode image as base64
    cur_pos = file_stream.tell()
    try:
        base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
    finally:
        file_stream.seek(cur_pos)

    # Create data URI
    data_uri = f"data:{content_type};base64,{base64_image}"

    # Call OpenAI API
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": data_uri},
                },
            ],
        }
    ]

    response = client.chat.completions.create(model=model, messages=messages)
    return response.choices[0].message.content

Supported Image Types

LLM descriptions are generated for:
  • .jpg, .jpeg (JPEG images)
  • .png (PNG images)
Other image formats are not currently supported.

Advanced Examples

Multiple Images with Different Prompts

from markitdown import MarkItDown
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key="your-api-key")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

image_types = {
    "product": "Describe this product image for an e-commerce catalog.",
    "portrait": "Describe this portrait photo, including mood and setting.",
    "landscape": "Describe this landscape, highlighting natural features.",
}

for image_file in Path("images").glob("*.jpg"):
    # Determine image type from filename or metadata
    image_type = "landscape"  # Example
    prompt = image_types.get(image_type, "Describe this image in detail.")
    
    result = md.convert(
        str(image_file),
        llm_prompt=prompt
    )
    
    output_file = Path("output") / f"{image_file.stem}.md"
    output_file.write_text(result.markdown)

Accessibility Captions

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

accessibility_prompt = """
Create an accessible description of this image for screen reader users.
Include:
- Main subject and action
- Important details and context
- Colors if relevant
- Spatial relationships
Keep it concise but informative.
"""

md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt=accessibility_prompt
)

result = md.convert("diagram.png")
print(result.markdown)

Content Moderation

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

moderation_prompt = """
Describe this image and indicate if it contains:
- Inappropriate content
- Sensitive information
- Brand logos or trademarks
Provide a neutral, factual description.
"""

md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt=moderation_prompt
)

result = md.convert("user_upload.jpg")
if "inappropriate" in result.markdown.lower():
    print("Warning: Image may contain inappropriate content")

Custom OpenAI Client Configuration

from openai import OpenAI
from markitdown import MarkItDown

# Custom timeout and retry settings
client = OpenAI(
    api_key="your-api-key",
    timeout=30.0,
    max_retries=3,
)

md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)

Azure OpenAI

from openai import AzureOpenAI
from markitdown import MarkItDown

# Azure OpenAI client
client = AzureOpenAI(
    api_key="your-azure-api-key",
    api_version="2024-02-01",
    azure_endpoint="https://your-resource.openai.azure.com/"
)

md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"  # Your Azure deployment name
)

result = md.convert("image.jpg")

Alternative LLM Providers

Any OpenAI-compatible API:
from openai import OpenAI
from markitdown import MarkItDown

# Example: Using a compatible API endpoint
client = OpenAI(
    api_key="your-api-key",
    base_url="https://api.alternative-provider.com/v1"
)

md = MarkItDown(
    llm_client=client,
    llm_model="vision-model-name"
)

Cost Considerations

LLM API calls incur costs. Each image conversion with LLM integration makes one API request.
Cost factors:
  • Model choice: gpt-4o-mini is more cost-effective than gpt-4o
  • Image size: Larger images consume more tokens
  • Prompt length: Longer prompts increase costs
  • Frequency: Each image conversion = one API call
Estimated costs (as of 2024):
  • gpt-4o: ~$0.01-0.02 per image
  • gpt-4o-mini: ~$0.001-0.002 per image

Minimize Costs

from markitdown import MarkItDown
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key="your-api-key")

# Only use LLM for specific images
md_with_llm = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
md_without_llm = MarkItDown()

for image_file in Path("images").glob("*.jpg"):
    # Use LLM only for important images
    if "important" in image_file.stem:
        result = md_with_llm.convert(str(image_file))
    else:
        result = md_without_llm.convert(str(image_file))

Error Handling

from markitdown import MarkItDown
from openai import OpenAI, OpenAIError

client = OpenAI(api_key="your-api-key")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

try:
    result = md.convert("image.jpg")
except OpenAIError as e:
    print(f"OpenAI API error: {e}")
    # Fall back to conversion without LLM
    md_fallback = MarkItDown()
    result = md_fallback.convert("image.jpg")
except Exception as e:
    print(f"Conversion error: {e}")

Without LLM Integration

If no LLM client is configured, image conversion includes only metadata:
from markitdown import MarkItDown

md = MarkItDown()  # No LLM client
result = md.convert("photo.jpg")
print(result.markdown)
Output:
ImageSize: 1920x1080
DateTimeOriginal: 2024:02:28 10:30:00
GPSPosition: 37.7749° N, 122.4194° W

Troubleshooting

OpenAI Package Not Installed

ImportError: No module named 'openai'
Solution:
pip install openai

Invalid API Key

AuthenticationError: Invalid API key
Check:
  • API key is correct
  • API key has not expired
  • Account has available credits

Model Not Found

NotFoundError: Model 'gpt-4o' not found
Ensure:
  • Model name is correct
  • Your account has access to the model
  • Using correct API endpoint (OpenAI vs Azure)

No Description Generated

If the description is missing:
# Check if LLM client and model are set
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"  # Must be specified
)

# Verify image format is supported
result = md.convert("image.jpg")  # .jpg or .png only

Best Practices

  • Use gpt-4o-mini for cost-effective descriptions
  • Customize prompts for your specific use case
  • Handle API errors gracefully with fallbacks
  • Consider caching results for repeated conversions
  • Monitor API usage and costs
  • Use appropriate image sizes (resize large images if needed)

Build docs developers (and LLMs) love