Create custom document converters to extend MarkItDown’s file format support
Custom converters allow you to extend MarkItDown to support additional file formats. By implementing the DocumentConverter base class, you can add conversion logic for any file type.
Create a new class that inherits from DocumentConverter:
3
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfofrom typing import BinaryIO, Anyclass MyCustomConverter(DocumentConverter): """Converts custom file format to Markdown.""" pass
4
Implement the accepts() Method
5
The accepts() method determines whether your converter should handle a given file. It receives:
6
file_stream: The file-like object (must support seek(), tell(), and read())
stream_info: Metadata about the file (mimetype, extension, charset, url, etc.)
**kwargs: Additional keyword arguments
7
def accepts( self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any,) -> bool: mimetype = (stream_info.mimetype or "").lower() extension = (stream_info.extension or "").lower() # Check by extension if extension in [".custom", ".cst"]: return True # Check by mimetype if mimetype == "application/x-custom": return True return False
8
Important: If you need to read from file_stream to make a determination, you must reset the stream position before returning:
cur_pos = file_stream.tell() # Save current positiondata = file_stream.read(100) # Peek at first 100 bytesfile_stream.seek(cur_pos) # Reset to original position
9
Implement the convert() Method
10
The convert() method performs the actual conversion. It receives the same parameters as accepts() and must return a DocumentConverterResult:
11
def convert( self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any,) -> DocumentConverterResult: # Read and decode the file encoding = stream_info.charset or "utf-8" content = file_stream.read().decode(encoding) # Convert to Markdown markdown = self._convert_to_markdown(content) # Extract title if available title = self._extract_title(content) return DocumentConverterResult( markdown=markdown, title=title )
12
Register the Converter
13
Register your converter with a MarkItDown instance:
14
from markitdown import MarkItDownmd = MarkItDown()md.register_converter(MyCustomConverter())# Use itresult = md.convert("document.custom")print(result.markdown)
The DocumentConverterResult class (_base_converter.py:5) wraps the conversion output:
result = DocumentConverterResult( markdown="# Converted Content\n\nThis is the converted text.", title="Optional Document Title")# Access the markdownprint(result.markdown)print(str(result)) # Same as markdown# Access the titleprint(result.title)
Stream Position Management: Always reset file_stream position after reading in accepts(). The convert() method expects the stream to be at the original position.
Charset Detection: Use stream_info.charset when available, or employ libraries like charset_normalizer to detect encoding automatically.
Dependency Handling: For optional dependencies, catch import errors gracefully and raise MissingDependencyException during conversion if needed.