Hybrid OCR System

Overview

The system uses a 3-tier hybrid OCR strategy to extract text from any PDF document — whether digitally created, scanned, or using legacy font encodings. The intelligent extraction pipeline automatically selects the optimal method based on document characteristics.

Supports 80+ languages including all major Indian languages (Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, Gujarati, Punjabi, Odia, Assamese, Urdu, Sanskrit, Nepali).

Extraction Pipeline

PyPDF2 (Text-Based PDFs)

Speed: Fastest (< 1 second)Use Case: Digitally-created PDFs with embedded textHow It Works:

pdf_reader = PyPDF2.PdfReader(pdf_file)
for page in pdf_reader.pages:
    text += page.extract_text()

Success Criteria:

≥150 characters per page extracted
No garbled encoding detected
≥50 words per page
≥30% substantive lines (≥40 chars)

If any criterion fails, system automatically triggers OCR fallback.

Tesseract OCR (Fast OCR)

Speed: Fast (2-5 seconds per page at 300 DPI)Use Case: Standard scanned documents, Hindi/English contentLanguages: Hindi (hin), English (eng), Kannada (kan), Tamil (tam), Telugu (tel), Bengali (ben), and moreProcess:

Convert PDF to images (300 DPI)
Grayscale conversion
Contrast enhancement (2.0x)
Tesseract OCR with language pack
Confidence scoring per word

Configuration:

pytesseract.image_to_string(
    image,
    lang='hin+eng',      # Multi-language
    config='--psm 3 --oem 1'  # Auto segmentation, LSTM engine
)

Confidence Threshold: ≥60%If confidence <60% or <100 chars extracted → EasyOCR fallback

EasyOCR (High-Accuracy OCR)

Speed: Slower (10-30 seconds per page)Use Case: Complex Indian scripts, low-quality scans, or when Tesseract failsLanguages: 80+ including all Indian languagesTechnology: CNN + LSTM neural networksProcess:

Load EasyOCR reader (lazy-loaded, GPU optional)
Convert PDF to images (resized to max 1500px)
Grayscale + contrast enhancement
Deep learning OCR inference
Confidence scoring

Subprocess Isolation:

# Runs in separate process to prevent OOM crashes
process = multiprocessing.Process(
    target=easyocr_worker,
    args=(pdf_bytes, num_pages, languages),
    daemon=True
)
process.start()
process.join(timeout=120)  # 120-second timeout

Memory Protection:

Max image dimension: 1500px
Subprocess timeout: 120 seconds
OOM detection: Exit code -9 or 137

If EasyOCR subprocess is OOM-killed, document processing fails gracefully without crashing the server.

PyMuPDF Fallback (Last Resort)

Speed: Medium (5-10 seconds per page)Use Case: PDFs with unusual image embeddings that pdf2image can’t renderTriggers:

All pdf2image-based OCR paths return 0 chars
Common with eOffice PDFs, JBIG2/CCITT compression, unusual XObjects

How It Works:

# PyMuPDF renders where Poppler fails
doc = fitz.open(stream=pdf_bytes, filetype="pdf")
zoom = dpi / 72.0
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
image = Image.open(io.BytesIO(pix.tobytes("png")))

# Then OCR with Tesseract
text = pytesseract.image_to_string(image)

Advantages:

Handles JBIG2 compression (common in government documents)
Renders CCITT Group 4 fax encoding
Supports unusual XObject structures

Automatic OCR Triggers

1. Low Text Density

Threshold: <150 characters per page

avg_chars_per_page = text_length / pages_extracted
if avg_chars_per_page < 150:
    # Trigger OCR

Example:

PyPDF2 extracted 342 chars from 3 pages (avg: 114 chars/page)
OCR needed: true (low density)

2. Garbled Encoding Detection

Problem: Legacy Indian PDFs (Krutidev, ISM fonts) map Devanagari glyphs to Latin Extended Unicode range (U+00A0–U+024F). Detection:

garbled_chars = sum(1 for c in text if '\u00a0' <= c <= '\u024f')
garbled_ratio = garbled_chars / len(text)

if garbled_ratio > 0.25:
    # Trigger OCR - text is actually font-mapping garbage

Example:

PyPDF2 output: "ºÉÚSÉxÉÉ ¨ÉÆjÉÉ±ÉªÉ"
Garbled encoding detected (47% Latin-Extended chars)
OCR needed: true

Why This Happens

Legacy Indian document management systems used custom font encodings:

Krutidev: Maps Devanagari to ASCII range
ISM: Maps Indian scripts to extended Latin
eOffice: Custom font encodings

PyPDF2 reads the Unicode code points directly (e.g., \u00c9 = É), not the visual glyphs. This produces garbage text like “ºÉÚSÉxÉÉ” instead of the intended Hindi.Solution: OCR reads the visual glyphs directly, bypassing font encoding issues.

3. Sparse Content

Thresholds:

Total text: <500 characters
Words per page: <50
Substantive lines (<40 chars): <30%

if text_length < 500:
    # Trigger OCR

words_per_page = len(words) / pages_extracted
if words_per_page < 50 and substantive_ratio < 0.3:
    # Trigger OCR

4. No Text Lines Found

lines = [line.strip() for line in text.split('\n') if line.strip()]
if not lines:
    # Trigger OCR

Language Detection

Primary: langdetect

Library: langdetect (Google’s language detection) Process:

from langdetect import detect

# Use first 1000 chars for faster detection
sample = text[:1000]
detected_lang = detect(sample)  # Returns 'hi', 'en', 'kn', etc.

Supported Languages:

hi: Hindi
en: English
kn: Kannada
ta: Tamil
te: Telugu
bn: Bengali
gu: Gujarati
ml: Malayalam
mr: Marathi
pa: Punjabi
or: Odia
as: Assamese
ur: Urdu
sa: Sanskrit
ne: Nepali

Fallback: Script Detection

If language detection fails or returns unsupported language, use script analysis:

def detect_language_by_script(text):
    # Detect Unicode script ranges
    scripts = {
        'Devanagari': 0,  # U+0900-U+097F
        'Bengali': 0,     # U+0980-U+09FF
        'Kannada': 0,     # U+0C80-U+0CFF
        'Tamil': 0,       # U+0B80-U+0BFF
        'Telugu': 0,      # U+0C00-U+0C7F
        # ... more scripts
    }
    
    for char in text:
        if '\u0900' <= char <= '\u097F':
            scripts['Devanagari'] += 1
        elif '\u0980' <= char <= '\u09FF':
            scripts['Bengali'] += 1
        # ... more ranges
    
    # Return language for most common script
    primary_script = max(scripts.items(), key=lambda x: x[1])[0]
    return SCRIPT_TO_LANGUAGE_MAP[primary_script]

Script-to-Language Mapping:

Devanagari → Hindi (best Devanagari support)
Bengali/Assamese → Bengali
Kannada → Kannada
Tamil → Tamil
Telugu → Telugu
Gujarati → Gujarati
Malayalam → Malayalam
Gurmukhi → Punjabi

OCR Configuration by Language

Hindi
Tamil
Telugu
Bengali
Kannada

Tesseract: hin+engEasyOCR: ['hi', 'en']Script: Devanagari (U+0900-U+097F)Strengths: Best support across all OCR engines

Tesseract: tam+engEasyOCR: ['ta', 'en']Script: Tamil (U+0B80-U+0BFF)Strengths: EasyOCR excels at complex Tamil ligatures

Tesseract: tel+engEasyOCR: ['te', 'en']Script: Telugu (U+0C00-U+0C7F)Strengths: EasyOCR handles curved Telugu script better

Tesseract: ben+engEasyOCR: ['bn', 'en']Script: Bengali (U+0980-U+09FF)Note: Also used for Assamese (similar script)

Tesseract: kan+engEasyOCR: ['kn', 'en']Script: Kannada (U+0C80-U+0CFF)Strengths: Both engines perform well

Confidence Scoring

Tesseract Confidence

ocr_data = pytesseract.image_to_data(
    image, 
    lang=languages,
    output_type=pytesseract.Output.DICT
)

# Extract valid confidence scores (>0)
valid_confidences = [c for c in ocr_data['conf'] if c > 0]
avg_confidence = sum(valid_confidences) / len(valid_confidences)

Interpretation:

80-100%: Excellent quality, use Tesseract
60-79%: Good quality, use Tesseract
<60%: Poor quality, fallback to EasyOCR

EasyOCR Confidence

reader = easyocr.Reader(['hi', 'en'])
results = reader.readtext(image, detail=1)

# results = [(bbox, text, confidence), ...]
confidences = [conf * 100 for (bbox, text, conf) in results]
avg_confidence = sum(confidences) / len(confidences)

Interpretation:

90-100%: Excellent quality
70-89%: Good quality
50-69%: Acceptable quality
<50%: Poor quality (check scan quality)

Quality Assessment

Document Quality Tiers

def assess_document_quality(text, page_count, ocr_confidence):
    chars_per_page = len(text) / page_count
    
    if chars_per_page >= 150:
        # Digital document
        return {
            'type': 'digital',
            'quality_tier': 'high',
            'recommended_engine': 'pypdf2'
        }
    elif ocr_confidence >= 80:
        return {
            'type': 'scanned',
            'quality_tier': 'high',
            'recommended_engine': 'tesseract'
        }
    elif ocr_confidence >= 60:
        return {
            'type': 'scanned',
            'quality_tier': 'medium',
            'recommended_engine': 'tesseract'
        }
    else:
        return {
            'type': 'scanned',
            'quality_tier': 'low',
            'recommended_engine': 'easyocr'
        }

Response Metadata

{
  "is_scanned": true,
  "extraction_method": "easyocr",
  "ocr_confidence": 87.5,
  "detected_language": "hi",
  "language_name": "Hindi",
  "detection_method": "language_detection",
  "quality_info": {
    "type": "scanned",
    "text_density": 142.3,
    "ocr_confidence": 87.5,
    "quality_tier": "high",
    "recommended_engine": "tesseract"
  }
}

Image Preprocessing

Contrast Enhancement

from PIL import ImageEnhance

# Convert to grayscale
image = image.convert('L')

# Enhance contrast (2.0x)
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2.0)

Before/After:

Improves OCR accuracy on low-contrast scans
Reduces noise in background
Sharpens text edges

Image Resizing (EasyOCR)

# Limit max dimension to prevent OOM
max_dimension = 1500
w, h = image.size
max_dim = max(w, h)

if max_dim > max_dimension:
    scale = max_dimension / max_dim
    new_w = int(w * scale)
    new_h = int(h * scale)
    image = image.resize((new_w, new_h), Image.LANCZOS)

Why:

EasyOCR can OOM on large images (>2000px)
Resizing to 1500px maintains quality while reducing memory
Subprocess isolation prevents server crashes

Error Handling

OCR Libraries Not Installed

Symptoms:

{
  "extraction_method": "pypdf2_no_ocr",
  "error": "Document needs OCR but OCR libraries not installed"
}

Solution:

# Install Tesseract
apt-get install tesseract-ocr tesseract-ocr-hin tesseract-ocr-eng

# Install Python libraries
pip install pytesseract pdf2image Pillow

# Install EasyOCR (optional, for high accuracy)
pip install easyocr torch torchvision

EasyOCR OOM (Out of Memory)

Symptoms:

{
  "extraction_method": "easyocr_failed",
  "error": "EasyOCR subprocess crashed (OOM killed)"
}

Cause: Large image + GPU/CPU memory exhaustedSolution:

Reduce DPI (300 → 200)
Reduce max image dimension (1500 → 1000)
Falls back to Tesseract automatically

PyMuPDF Not Available

Symptoms:

{
  "extraction_method": "pymupdf_tesseract_failed",
  "error": "PyMuPDF not installed"
}

Solution:

pip install pymupdf

Impact: Can’t render eOffice/JBIG2 PDFs (rare edge case)

Gibberish in OCR Output

Symptoms:

⚠️ Detected gibberish in Tesseract OCR output

Causes:

Very poor scan quality
Wrong language pack selected
Document uses unsupported script

Solution:

System automatically tries EasyOCR
Check extracted text preview
Try preprocessing PDF (increase contrast, denoise)

Performance Benchmarks

PyPDF2

Speed: <1 secondAccuracy: 100% (for text PDFs)Use: 70% of documents

Tesseract OCR

Speed: 2-5 seconds/pageAccuracy: 85-95% (Hindi/English)Use: 25% of documents

EasyOCR

Speed: 10-30 seconds/pageAccuracy: 95-99% (all languages)Use: 5% of documents

Best Practices

For best OCR results:

Use 300 DPI scans (higher DPI = slower but more accurate)
Ensure good lighting and contrast in original scans
Remove watermarks and background noise
Use color scans if possible (better quality retention)

For faster processing:

Reduce num_pages (1-2 pages usually sufficient for tagging)
Provide text-based PDFs when possible
Use lower DPI (200) for simple documents

Avoid common mistakes:

Don’t use very low DPI (<150) — OCR accuracy drops significantly
Don’t process handwritten documents (OCR designed for printed text)
Don’t expect 100% accuracy on degraded scans

Getting Started

Core Features

User Guides

Deployment

Overview

Extraction Pipeline

Automatic OCR Triggers

1. Low Text Density

2. Garbled Encoding Detection

3. Sparse Content

4. No Text Lines Found

Language Detection

Primary: langdetect

Fallback: Script Detection

OCR Configuration by Language

Confidence Scoring

Tesseract Confidence

EasyOCR Confidence

Quality Assessment

Document Quality Tiers

Response Metadata

Image Preprocessing

Contrast Enhancement

Image Resizing (EasyOCR)

Error Handling

Performance Benchmarks

PyPDF2

Tesseract OCR

EasyOCR

Best Practices

Build docs developers (and LLMs) love

Getting Started

Core Features

User Guides

Deployment

​Overview

​Extraction Pipeline

​Automatic OCR Triggers

​1. Low Text Density

​2. Garbled Encoding Detection

​3. Sparse Content

​4. No Text Lines Found

​Language Detection

​Primary: langdetect

​Fallback: Script Detection

​OCR Configuration by Language

​Confidence Scoring

​Tesseract Confidence

​EasyOCR Confidence

​Quality Assessment

​Document Quality Tiers

​Response Metadata

​Image Preprocessing

​Contrast Enhancement

​Image Resizing (EasyOCR)

​Error Handling

​Performance Benchmarks

PyPDF2

Tesseract OCR

EasyOCR

​Best Practices

Build docs developers (and LLMs) love

Overview

Extraction Pipeline

Automatic OCR Triggers

1. Low Text Density

2. Garbled Encoding Detection

3. Sparse Content

4. No Text Lines Found

Language Detection

Primary: langdetect

Fallback: Script Detection

OCR Configuration by Language

Confidence Scoring

Tesseract Confidence

EasyOCR Confidence

Quality Assessment

Document Quality Tiers

Response Metadata

Image Preprocessing

Contrast Enhancement

Image Resizing (EasyOCR)

Error Handling

Performance Benchmarks

Best Practices