DocumentContent
Represents a document that has been loaded and is ready for translation.Individual pages of the document ready for translation.
- Text-based content (TXT, text-based PDFs, DOCX): Contains strings
- Image-based content (scanned PDFs): Contains PNG image bytes
MIME type of the content. Must match pattern
^(text|image)/.+$."text/plain"- Text content"image/png"- Image content (scanned PDFs)
Document metadata including:
file_type: The detected file type (FileTypeenum)total_pages: Total number of pagestitle: Document title (if available)author: Document author (if available)creation_date: Creation date string (if available)modification_date: Last modification date (if available)
DocumentContent is immutable (frozen=True). Once created, its fields cannot be modified.DocumentMetadata
Metadata about a processed document.The type of file. One of:
FileType.PDF- PDF documentsFileType.DOCX- Word documentsFileType.TXT- Plain text files
Total number of pages in the document. Must be >= 1.
Document title extracted from metadata, if available.
Document author extracted from metadata, if available.
Document creation date as string, if available.
Last modification date as string, if available.
Additional processor-specific metadata.
Processor Functions
get_processor_for_file_type
Get the appropriate document processor for a specific file type.The file type to get a processor for. Must be one of:
FileType.PDFFileType.DOCXFileType.TXT
Optional processor-specific settings.For PDF processors:
dpi: Resolution for rendering PDF pages to images (default: 200)
An instance implementing the
DocumentProcessor protocol:PdfProcessorfor PDFsDocxProcessorfor DOCX filesTextProcessorfor TXT files
load_document
High-level function to load any supported document format.Path to the document to load. File type is automatically detected from extension.Supported extensions:
.pdf- PDF documents.docx- Microsoft Word documents.txt- Plain text files
Optional settings to pass to the document processor.For PDF files, you can specify rendering settings:
A
DocumentContent instance containing:pages: List of page contents (strings for text, bytes for images)content_type: MIME type ("text/plain"or"image/png")metadata: Document metadata including file type, page count, title, author, etc.
DocumentProcessor Protocol
TheDocumentProcessor protocol defines the interface that all document processors must implement.
supported_types
Set of file types this processor can handle.
get_metadata
Path to the document.
Extracted document metadata.
extract_content
Path to the document.
First page to extract (1-indexed).
Last page to extract (inclusive), or
None to extract all remaining pages.Async iterator yielding page contents. Text pages yield strings, image pages yield PNG bytes.
ProcessingError
Exception raised when document processing fails.- File not found
- Unsupported file type
- Corrupted or invalid document
- Missing dependencies (e.g., poppler for PDFs)
- Insufficient permissions