Optical Character Recognition (OCR)

IPED integrates Tesseract 5 OCR engine to extract text from images, scanned documents, and PDFs without embedded text, making visual content searchable.

Overview

The OCR feature enables:

Text extraction from images in 70+ languages
Scanned document processing
PDF page-by-page OCR
Multi-format image support
Indexed and searchable OCR results

Powered by Tesseract 5

IPED uses Tesseract version 5, the latest open-source OCR engine:

Improved accuracy over previous versions
LSTM neural network-based recognition
Better handling of complex layouts
Support for multiple languages simultaneously

// Detected at startup
LOGGER.info("Detected Tesseract " + tessVersion);

Supported Image Formats

Direct Support (Native Tesseract)

PNG
JPEG
TIFF (multi-page)
BMP
PBM, PGM, PPM (Portable formats)

Extended Support (with conversion)

GIF
WebP
JPEG 2000 (JP2, JPX)
HEIC/HEIF
Adobe Photoshop (PSD)
RAW camera formats
SVG, EMF, WMF vector formats

Non-Image Formats

PDF - Each page converted to image for OCR

Configuration

OCR is configured in conf/OCRConfig.txt:

# Enable OCR processing
tesseract.enabled = true

# Tesseract binary path (leave empty if in PATH)
tesseract.path = 

# Language model (por, eng, spa, etc.)
ocr.language = por

# Page segmentation mode (1-13)
ocr.pageSegMode = 1

# Skip files found in hash database
ocr.skipKnownFiles = false

# File size limits (bytes)
ocr.minFileSize = 10000
ocr.maxFileSize = 100000000

# Process non-standard image formats
ocr.processNonStandard = true

# Maximum image dimension for conversion
ocr.maxConvImageSize = 3000

Language Models

IPED supports all Tesseract language models:

Pre-installed

Portuguese (por)
English (eng)

Additional Languages

Download from Tesseract models:

Spanish (spa)
French (fra)
German (deu)
Italian (ita)
Chinese (chi_sim, chi_tra)
Arabic (ara)
Russian (rus)
Japanese (jpn)
And 60+ more languages

Place downloaded .traineddata files in Tesseract’s tessdata directory.

Page Segmentation Modes

The pageSegMode parameter controls how Tesseract analyzes page layout:

0 - Orientation and script detection only
1 - Automatic page segmentation with OSD (default)
3 - Fully automatic without OSD
4 - Single column of text
6 - Single uniform block of text
7 - Single text line
11 - Sparse text (find as much text as possible)
13 - Raw line (treats image as single text line)

Implementation Details

OCR processing flow:

1. File Filtering

if (size >= MIN_SIZE && size <= MAX_SIZE 
    && !SKIP_KNOWN_FILES && itemInfo.isKnown()) {
    // Process OCR
}

2. Image Preparation

For standard images:

Read directly into Tesseract

For PDFs:

private void parsePDF(XHTMLContentHandler xhtml, File input, File output) {
    PDFToImage pdfConverter = new PDFToImage();
    pdfConverter.load(input);
    for (int page = 0; page < pdfConverter.getNumPages(); page++) {
        // Convert page to image
        // OCR the image
        // Aggregate results
    }
}

For multi-page TIFF:

private void parseTiff(XHTMLContentHandler xhtml, File input) {
    ImageReader reader = ImageIO.getImageReaders(iis).next();
    int numPages = reader.getNumImages(true);
    // OCR each page separately to avoid timeouts
}

For non-standard formats:

Convert to standard format using ImageMagick or internal converters
Resize if exceeds maximum dimensions
OCR converted image

3. Tesseract Execution

String[] cmd = { 
    tesseractPath, 
    INPUT_FILE_TOKEN,   // Input image
    OUTPUT_FILE_TOKEN,  // Output text file
    "-l", LANGUAGE,      // Language model
    "--psm", PAGESEGMODE // Page segmentation mode
};

Process process = new ProcessBuilder(cmd).start();

4. Result Storage

OCR results stored in SQLite database:

CREATE TABLE IF NOT EXISTS ocr(
    id TEXT PRIMARY KEY,     -- Hash + page number
    text TEXT                -- Extracted text
);

Benefits:

Fast lookup for duplicate files
Persistent across processing sessions
Reduced reprocessing time
Efficient storage

Performance Optimization

Image Preprocessing

Downsampling - Large images resized to 3000x3000 max
Format conversion - Non-standard formats converted once
Page-by-page - Large documents processed incrementally

Multithreading

OpenMP disabled to prevent thread conflicts:

env.put("OMP_THREAD_LIMIT", "1");

Each file processed in separate thread by IPED task manager

Caching

Results cached in database by file hash
Duplicate files skip OCR entirely
Significant speedup for datasets with duplicates

Timeouts

Prevents hanging on problematic images:

Configurable timeout per operation
Automatic retry with conversion for corrupted images

Subset Processing

OCR can be limited to specific bookmarks or categories:

subsetToOcr = Important_#_Bookmarked

This enables:

Faster initial processing
Two-pass workflow (process all, then OCR subset)
Resource management for large cases

Quality Metrics

IPED tracks OCR quality:

metadata.set(OCRParser.OCR_CHAR_COUNT, Integer.toString(charCount));

Character count stored as metadata
Helps identify successful vs. failed OCR
Useful for quality assessment

Integration Features

Indexed Text

OCR text added to full-text index
Searchable like native document text
Hit highlighting in OCR content

Result Display

OCR text shown in text viewer
Formatted as plain text
Clearly marked as OCR content

Export

OCR text included in reports
Available in HTML and CSV exports
Preserved in portable cases

Use Cases

Scanned Documents

Extract text from:

Scanned contracts and agreements
Historical documents
Medical records
Financial statements

Screenshots

Make searchable:

Chat application screenshots
Web page captures
Email screenshots
Error messages and logs

Photo Evidence

Extract text from:

License plates
Street signs
Document photographs
Whiteboard images

Memes and Graphics

Process:

Image macros with overlaid text
Infographics
Social media content

Accuracy Considerations

Factors Affecting Accuracy

Image Quality

Resolution (300+ DPI recommended)
Contrast and brightness
Focus and sharpness

Text Characteristics

Font type and size
Text orientation
Language and script

Background

Noise and artifacts
Complex backgrounds
Skew and rotation

Improving Results

Choose appropriate language model - Match document language
Adjust page segmentation - Try different modes for layout types
Preprocess images - Enhance contrast, deskew, denoise
Use higher DPI - Scan documents at 300 DPI or higher

Troubleshooting

Low Accuracy

Verify correct language model selected
Check image quality and resolution
Try different page segmentation modes
Preprocess images to enhance text

Performance Issues

Reduce maxFileSize to skip very large images
Enable skipKnownFiles to avoid processing known content
Use subsetToOcr for selective processing
Increase processing threads if CPU underutilized

Missing Text

Verify OCR is enabled in configuration
Check file size within min/max limits
Ensure Tesseract installed and in PATH
Review OCR character count in metadata

External Dependencies

Required:

Tesseract 5.x - OCR engine binary

Optional:

ImageMagick - Extended format support
Ghostscript - PDF to image conversion
libgif, libopenjp2, libwebp - Additional image format libraries

Getting Started

Processing Evidence

Analysis Interface

Core Features

Parsers & Artifacts

Advanced Usage

Reference

​Overview

​Powered by Tesseract 5

​Supported Image Formats

​Direct Support (Native Tesseract)

​Extended Support (with conversion)

​Non-Image Formats

​Configuration

​Language Models

​Pre-installed

​Additional Languages

​Page Segmentation Modes

​Implementation Details

​1. File Filtering

​2. Image Preparation

​3. Tesseract Execution

​4. Result Storage

​Performance Optimization

​Image Preprocessing

​Multithreading

​Caching

​Timeouts

​Subset Processing

​Quality Metrics

​Integration Features

​Indexed Text

​Result Display

​Export

​Use Cases

​Scanned Documents

​Screenshots

​Photo Evidence

​Memes and Graphics

​Accuracy Considerations

​Factors Affecting Accuracy

​Improving Results

​Troubleshooting

​Low Accuracy

​Performance Issues

​Missing Text

​External Dependencies

Build docs developers (and LLMs) love