Overview
The OCR feature enables:- Text extraction from images in 70+ languages
- Scanned document processing
- PDF page-by-page OCR
- Multi-format image support
- Indexed and searchable OCR results
Powered by Tesseract 5
IPED uses Tesseract version 5, the latest open-source OCR engine:- Improved accuracy over previous versions
- LSTM neural network-based recognition
- Better handling of complex layouts
- Support for multiple languages simultaneously
Supported Image Formats
Direct Support (Native Tesseract)
- PNG
- JPEG
- TIFF (multi-page)
- BMP
- PBM, PGM, PPM (Portable formats)
Extended Support (with conversion)
- GIF
- WebP
- JPEG 2000 (JP2, JPX)
- HEIC/HEIF
- Adobe Photoshop (PSD)
- RAW camera formats
- SVG, EMF, WMF vector formats
Non-Image Formats
- PDF - Each page converted to image for OCR
Configuration
OCR is configured inconf/OCRConfig.txt:
Language Models
IPED supports all Tesseract language models:Pre-installed
- Portuguese (
por) - English (
eng)
Additional Languages
Download from Tesseract models:- Spanish (
spa) - French (
fra) - German (
deu) - Italian (
ita) - Chinese (
chi_sim,chi_tra) - Arabic (
ara) - Russian (
rus) - Japanese (
jpn) - And 60+ more languages
.traineddata files in Tesseract’s tessdata directory.
Page Segmentation Modes
ThepageSegMode parameter controls how Tesseract analyzes page layout:
- 0 - Orientation and script detection only
- 1 - Automatic page segmentation with OSD (default)
- 3 - Fully automatic without OSD
- 4 - Single column of text
- 6 - Single uniform block of text
- 7 - Single text line
- 11 - Sparse text (find as much text as possible)
- 13 - Raw line (treats image as single text line)
Implementation Details
OCR processing flow:1. File Filtering
2. Image Preparation
For standard images:- Read directly into Tesseract
- Convert to standard format using ImageMagick or internal converters
- Resize if exceeds maximum dimensions
- OCR converted image
3. Tesseract Execution
4. Result Storage
OCR results stored in SQLite database:- Fast lookup for duplicate files
- Persistent across processing sessions
- Reduced reprocessing time
- Efficient storage
Performance Optimization
Image Preprocessing
- Downsampling - Large images resized to 3000x3000 max
- Format conversion - Non-standard formats converted once
- Page-by-page - Large documents processed incrementally
Multithreading
- OpenMP disabled to prevent thread conflicts:
- Each file processed in separate thread by IPED task manager
Caching
- Results cached in database by file hash
- Duplicate files skip OCR entirely
- Significant speedup for datasets with duplicates
Timeouts
Prevents hanging on problematic images:- Configurable timeout per operation
- Automatic retry with conversion for corrupted images
Subset Processing
OCR can be limited to specific bookmarks or categories:- Faster initial processing
- Two-pass workflow (process all, then OCR subset)
- Resource management for large cases
Quality Metrics
IPED tracks OCR quality:- Character count stored as metadata
- Helps identify successful vs. failed OCR
- Useful for quality assessment
Integration Features
Indexed Text
- OCR text added to full-text index
- Searchable like native document text
- Hit highlighting in OCR content
Result Display
- OCR text shown in text viewer
- Formatted as plain text
- Clearly marked as OCR content
Export
- OCR text included in reports
- Available in HTML and CSV exports
- Preserved in portable cases
Use Cases
Scanned Documents
Extract text from:- Scanned contracts and agreements
- Historical documents
- Medical records
- Financial statements
Screenshots
Make searchable:- Chat application screenshots
- Web page captures
- Email screenshots
- Error messages and logs
Photo Evidence
Extract text from:- License plates
- Street signs
- Document photographs
- Whiteboard images
Memes and Graphics
Process:- Image macros with overlaid text
- Infographics
- Social media content
Accuracy Considerations
Factors Affecting Accuracy
Image Quality- Resolution (300+ DPI recommended)
- Contrast and brightness
- Focus and sharpness
- Font type and size
- Text orientation
- Language and script
- Noise and artifacts
- Complex backgrounds
- Skew and rotation
Improving Results
- Choose appropriate language model - Match document language
- Adjust page segmentation - Try different modes for layout types
- Preprocess images - Enhance contrast, deskew, denoise
- Use higher DPI - Scan documents at 300 DPI or higher
Troubleshooting
Low Accuracy
- Verify correct language model selected
- Check image quality and resolution
- Try different page segmentation modes
- Preprocess images to enhance text
Performance Issues
- Reduce
maxFileSizeto skip very large images - Enable
skipKnownFilesto avoid processing known content - Use
subsetToOcrfor selective processing - Increase processing threads if CPU underutilized
Missing Text
- Verify OCR is enabled in configuration
- Check file size within min/max limits
- Ensure Tesseract installed and in PATH
- Review OCR character count in metadata
External Dependencies
Required:- Tesseract 5.x - OCR engine binary
- ImageMagick - Extended format support
- Ghostscript - PDF to image conversion
- libgif, libopenjp2, libwebp - Additional image format libraries