Tinbox uses specialized processors to extract content from different document types. Each processor is optimized for its file format and handles metadata extraction, content parsing, and error handling.
Supported File Types
PDF Converts pages to images for vision model translation
DOCX Extracts text from Word documents with paragraph structure
TXT Reads plain text files with UTF-8 encoding detection
Document Processing Architecture
All processors implement the DocumentProcessor protocol defined in processor/__init__.py:50-93:
class DocumentProcessor ( Protocol ):
"""Protocol for document processors."""
@ property
def supported_types ( self ) -> set[FileType]:
"""Get the file types supported by this processor."""
...
async def get_metadata ( self , file_path : Path) -> DocumentMetadata:
"""Extract metadata from a document."""
...
async def extract_content (
self , file_path : Path, * , start_page : int = 1 , end_page : int | None = None
) -> AsyncIterator[ str | bytes ]:
"""Extract content from a document."""
...
Document Content Model
From processor/__init__.py:18-34:
class DocumentContent ( BaseModel ):
"""Represents a document ready for translation."""
pages: list[ str | bytes ] # Individual pages for translation
content_type: str = Field( pattern = r " ^ ( text | image ) / . + $ " )
metadata: dict[ str , Any] = Field( default_factory = dict )
Pages can contain either text strings or image bytes, allowing Tinbox to handle both text-based and vision-based translations.
PDF Processing
PDF documents are converted to images for translation using vision models. This approach preserves layout, formatting, and handles complex PDFs with images and tables.
How It Works
Metadata Extraction : Uses pypdf to read PDF metadata and count pages
Page Conversion : Converts each page to PNG images using pdf2image
Image Generation : Default 200 DPI (configurable via settings)
Lazy Loading : Pages are yielded as an async iterator for memory efficiency
Code Example
From processor/pdf.py:118-175:
class PdfProcessor ( BaseDocumentProcessor ):
def __init__ ( self , settings : dict | None = None ):
super (). __init__ ()
self .settings = settings or {}
self .dpi = self .settings.get( "dpi" , 200 ) # Default DPI
async def extract_content (
self , file_path : Path, * , start_page : int = 1 , end_page : int | None = None
) -> AsyncIterator[ str | bytes ]:
"""Extract content from a PDF document."""
# Check if poppler is available
_check_poppler_available()
# Convert pages to images
convert_from_path = _get_convert_from_path()
pages = convert_from_path(
file_path,
first_page = start_page,
last_page = end_page,
dpi = self .dpi,
)
for page in pages:
with io.BytesIO() as bio:
page.save(bio, format = "PNG" )
yield bio.getvalue()
System Requirements
PDF processing requires two dependencies:
Python Package : pdf2image (installed with tinbox[pdf] or tinbox[all])
System Dependency : poppler-utils
macOS
Ubuntu/Debian
CentOS/RHEL
Fedora
Configuration
You can customize PDF processing settings:
from tinbox.core.processor import get_processor_for_file_type
from tinbox.core.types import FileType
# Create processor with custom DPI
processor = get_processor_for_file_type(
FileType. PDF ,
settings = { "dpi" : 300 } # Higher quality images
)
Higher DPI values (e.g., 300) produce better quality images but increase file size and processing time. Use 200 DPI for most documents.
From processor/pdf.py:78-116:
async def get_metadata ( self , file_path : Path) -> DocumentMetadata:
with open (file_path, "rb" ) as f:
pdf = pypdf.PdfReader(f)
info = pdf.metadata
title = info.get( "/Title" ) or file_path.name
return DocumentMetadata(
file_type = FileType. PDF ,
total_pages = len (pdf.pages),
title = title,
author = info.get( "/Author" , None ),
creation_date = info.get( "/CreationDate" , None ),
modification_date = info.get( "/ModDate" , None ),
custom_metadata = { "pdf_info" : dict (info) if info else {}},
)
DOCX Processing
Word documents are processed as single-page text documents, with paragraph extraction and RTL detection.
How It Works
Validation : Checks file extension and ZIP structure
Text Extraction : Extracts all paragraph text, filtering empty paragraphs
RTL Detection : Scans for Hebrew/Arabic characters
Metadata : Reads core properties (author, dates, keywords)
Code Example
From processor/docx.py:47-60:
def _extract_text ( doc : Document) -> str :
"""Extract all text from a Word document."""
paragraphs = []
for paragraph in doc.paragraphs:
if paragraph.text.strip():
paragraphs.append(paragraph.text.strip())
return " \n " .join(paragraphs)
RTL Language Detection
From processor/docx.py:23-44:
def detect_rtl ( text : str ) -> bool :
"""Detect if text contains RTL characters."""
rtl_ranges = [
( 0x 0590 , 0x 05FF ), # Hebrew
( 0x 0600 , 0x 06FF ), # Arabic
( 0x 0750 , 0x 077F ), # Arabic Supplement
( 0x 08A0 , 0x 08FF ), # Arabic Extended-A
( 0x FB50 , 0x FDFF ), # Arabic Presentation Forms-A
( 0x FE70 , 0x FEFF ), # Arabic Presentation Forms-B
]
return any (
any (start <= ord (char) <= end for start, end in rtl_ranges)
for char in text
)
From processor/docx.py:86-124:
async def get_metadata ( self , file_path : Path) -> DocumentMetadata:
doc = Document(file_path)
core_props = doc.core_properties
text = _extract_text(doc)
return DocumentMetadata(
file_type = FileType. DOCX ,
total_pages = 1 , # Treat as single text file
title = file_path.name,
author = core_props.author,
creation_date = str (core_props.created) if core_props.created else None ,
modification_date = str (core_props.modified) if core_props.modified else None ,
custom_metadata = {
"category" : core_props.category,
"comments" : core_props.comments,
"keywords" : core_props.keywords,
"language" : core_props.language,
"subject" : core_props.subject,
"contains_rtl" : detect_rtl(text),
},
)
DOCX files are treated as single-page documents. Use sliding-window or context-aware algorithms for best results with Word documents.
TXT Processing
Plain text files are the simplest document type, with UTF-8 encoding detection and RTL support.
How It Works
Encoding Detection : Attempts UTF-8, raises error on failure
Content Reading : Reads entire file as single string
RTL Detection : Scans for Hebrew/Arabic characters
Metadata : Extracts file stats (size, timestamps)
Code Example
From processor/text.py:43-61:
def detect_encoding ( file_path : Path) -> str :
"""Detect the encoding of a text file."""
try :
with open (file_path, encoding = "utf-8" ) as f:
f.read()
return "utf-8"
except UnicodeDecodeError :
raise ProcessingError( "Invalid UTF-8 encoding" )
From processor/text.py:120-152:
async def extract_content (
self , file_path : Path, * , start_page : int = 1 , end_page : int | None = None
) -> AsyncIterator[ str | bytes ]:
"""Extract content from a text document."""
if start_page != 1 or (end_page is not None and end_page != 1 ):
raise ProcessingError( "Text files only support page 1" )
encoding = detect_encoding(file_path)
text = file_path.read_text( encoding = encoding)
yield text
From processor/text.py:75-118:
async def get_metadata ( self , file_path : Path) -> DocumentMetadata:
encoding = detect_encoding(file_path)
text = file_path.read_text( encoding = encoding)
stats = file_path.stat()
return DocumentMetadata(
file_type = FileType. TXT ,
total_pages = 1 ,
title = file_path.name,
author = None ,
creation_date = str (stats.st_ctime),
modification_date = str (stats.st_mtime),
custom_metadata = {
"size_bytes" : stats.st_size,
"encoding" : encoding,
"contains_rtl" : detect_rtl(text),
},
)
TXT files must be UTF-8 encoded. Other encodings will raise a ProcessingError.
Loading Documents
The load_document function is the main entry point for document processing:
From processor/__init__.py:189-228:
async def load_document (
file_path : Path,
* ,
processor_settings : dict[ str , Any] | None = None ,
) -> DocumentContent:
"""Load a document and prepare it for translation."""
file_type = FileType(file_path.suffix.lstrip( "." ).lower())
processor = get_processor_for_file_type(file_type, settings = processor_settings)
# Get metadata first
metadata = await processor.get_metadata(file_path)
# Extract all pages
pages = []
async for page in processor.extract_content(file_path):
pages.append(page)
# Determine content type based on first page
content_type = "image/png" if isinstance (pages[ 0 ], bytes ) else "text/plain"
return DocumentContent(
pages = pages,
content_type = content_type,
metadata = { "file_type" : file_type, ** metadata.model_dump()},
)
Error Handling
All processors inherit from BaseDocumentProcessor and raise ProcessingError for consistent error handling:
class ProcessingError ( Exception ):
"""Error during document processing."""
pass
Common Errors
ProcessingError: File not found: document.pdf
The specified file path doesn’t exist.
ProcessingError: pdf2image is not installed. PDF image conversion requires the PDF extras.
Install with pip install tinbox[pdf].
ProcessingError: Invalid UTF-8 encoding
TXT file is not UTF-8 encoded.
Page Range Support
All processors support page range extraction:
# Extract specific page range
pages = []
async for page in processor.extract_content(
file_path,
start_page = 5 ,
end_page = 10
):
pages.append(page)
TXT and DOCX files only support start_page=1 and end_page=1 or None, since they’re treated as single-page documents.
File Type Detection
File types are detected by extension:
from tinbox.core.types import FileType
from pathlib import Path
file_path = Path( "document.pdf" )
file_type = FileType(file_path.suffix.lstrip( "." ).lower())
# FileType.PDF
Supported extensions: .pdf, .docx, .txt (case-insensitive)