Custom Converters

Custom converters allow you to extend MarkItDown to support additional file formats. By implementing the DocumentConverter base class, you can add conversion logic for any file type.

DocumentConverter Base Class

All converters inherit from the DocumentConverter abstract base class located in _base_converter.py:42.

from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any

class DocumentConverter:
    """Abstract superclass of all DocumentConverters."""
    
    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
        """Determine if this converter can handle the document."""
        raise NotImplementedError()
    
    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
        """Convert the document to Markdown."""
        raise NotImplementedError()

Creating a Custom Converter

Inherit from DocumentConverter

Create a new class that inherits from DocumentConverter:

from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any

class MyCustomConverter(DocumentConverter):
    """Converts custom file format to Markdown."""
    pass

Implement the accepts() Method

The accepts() method determines whether your converter should handle a given file. It receives:

file_stream: The file-like object (must support seek(), tell(), and read())

stream_info: Metadata about the file (mimetype, extension, charset, url, etc.)

**kwargs: Additional keyword arguments

def accepts(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool:
    mimetype = (stream_info.mimetype or "").lower()
    extension = (stream_info.extension or "").lower()
    
    # Check by extension
    if extension in [".custom", ".cst"]:
        return True
    
    # Check by mimetype
    if mimetype == "application/x-custom":
        return True
    
    return False

Important: If you need to read from file_stream to make a determination, you must reset the stream position before returning:

cur_pos = file_stream.tell()  # Save current position
data = file_stream.read(100)   # Peek at first 100 bytes
file_stream.seek(cur_pos)      # Reset to original position

Implement the convert() Method

The convert() method performs the actual conversion. It receives the same parameters as accepts() and must return a DocumentConverterResult:

def convert(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult:
    # Read and decode the file
    encoding = stream_info.charset or "utf-8"
    content = file_stream.read().decode(encoding)
    
    # Convert to Markdown
    markdown = self._convert_to_markdown(content)
    
    # Extract title if available
    title = self._extract_title(content)
    
    return DocumentConverterResult(
        markdown=markdown,
        title=title
    )

from markitdown import MarkItDown

md = MarkItDown()
md.register_converter(MyCustomConverter())

# Use it
result = md.convert("document.custom")
print(result.markdown)

Complete Example

Here’s a complete example based on the plain text converter (converters/_plain_text_converter.py:33):

from typing import BinaryIO, Any
from charset_normalizer import from_bytes
from markitdown import (
    DocumentConverter,
    DocumentConverterResult,
    StreamInfo
)

ACCEPTED_MIME_TYPE_PREFIXES = [
    "text/",
    "application/json",
    "application/markdown",
]

ACCEPTED_FILE_EXTENSIONS = [
    ".txt",
    ".text",
    ".md",
    ".markdown",
    ".json",
    ".jsonl",
]

class PlainTextConverter(DocumentConverter):
    """Converts plain text files to Markdown."""
    
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        
        # If we have a charset, safely assume it's text
        if stream_info.charset is not None:
            return True
        
        # Check extension
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        
        # Check mimetype prefix
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        
        return False
    
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> DocumentConverterResult:
        if stream_info.charset:
            text_content = file_stream.read().decode(stream_info.charset)
        else:
            text_content = str(from_bytes(file_stream.read()).best())
        
        return DocumentConverterResult(markdown=text_content)

StreamInfo Object

The StreamInfo dataclass (_stream_info.py:6) provides metadata about the file being converted:

@dataclass(kw_only=True, frozen=True)
class StreamInfo:
    mimetype: Optional[str] = None      # MIME type (e.g., "application/pdf")
    extension: Optional[str] = None     # File extension (e.g., ".pdf")
    charset: Optional[str] = None       # Character encoding (e.g., "utf-8")
    filename: Optional[str] = None      # Filename from path or URL
    local_path: Optional[str] = None    # Full local file path
    url: Optional[str] = None           # URL if fetched from web

DocumentConverterResult

The DocumentConverterResult class (_base_converter.py:5) wraps the conversion output:

result = DocumentConverterResult(
    markdown="# Converted Content\n\nThis is the converted text.",
    title="Optional Document Title"
)

# Access the markdown
print(result.markdown)
print(str(result))  # Same as markdown

# Access the title
print(result.title)

Converter Priority

When registering converters, you can specify priority to control the order they’re tried:

from markitdown import MarkItDown

md = MarkItDown()

# Higher priority (tried first)
md.register_converter(SpecificConverter(), priority=0.0)

# Lower priority (tried later)
md.register_converter(GenericConverter(), priority=10.0)

Priority values (_markitdown.py:54):

0.0 - PRIORITY_SPECIFIC_FILE_FORMAT (specific converters like PDF, DOCX)
10.0 - PRIORITY_GENERIC_FILE_FORMAT (generic converters like HTML, plain text)

Lower values are tried first. Converters with the same priority maintain registration order (most recent first).

Best Practices

Stream Position Management: Always reset file_stream position after reading in accepts(). The convert() method expects the stream to be at the original position.

Charset Detection: Use stream_info.charset when available, or employ libraries like charset_normalizer to detect encoding automatically.

Dependency Handling: For optional dependencies, catch import errors gracefully and raise MissingDependencyException during conversion if needed.

Get Started

Guides

File Formats

Advanced

DocumentConverter Base Class

Creating a Custom Converter

Complete Example

StreamInfo Object

DocumentConverterResult

Converter Priority

Best Practices

Next Steps

Plugin Development

Configuration

Build docs developers (and LLMs) love

Get Started

Guides

File Formats

Advanced

​DocumentConverter Base Class

​Creating a Custom Converter

​Complete Example

​StreamInfo Object

​DocumentConverterResult

​Converter Priority

​Best Practices

​Next Steps

Plugin Development

Configuration

Build docs developers (and LLMs) love

DocumentConverter Base Class

Creating a Custom Converter

Complete Example

StreamInfo Object

DocumentConverterResult

Converter Priority

Best Practices

Next Steps