Skip to main content

Overview

The system uses a 3-tier hybrid OCR strategy to extract text from any PDF document — whether digitally created, scanned, or using legacy font encodings. The intelligent extraction pipeline automatically selects the optimal method based on document characteristics.
Supports 80+ languages including all major Indian languages (Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, Gujarati, Punjabi, Odia, Assamese, Urdu, Sanskrit, Nepali).

Extraction Pipeline

1

PyPDF2 (Text-Based PDFs)

Speed: Fastest (< 1 second)Use Case: Digitally-created PDFs with embedded textHow It Works:
pdf_reader = PyPDF2.PdfReader(pdf_file)
for page in pdf_reader.pages:
    text += page.extract_text()
Success Criteria:
  • ≥150 characters per page extracted
  • No garbled encoding detected
  • ≥50 words per page
  • ≥30% substantive lines (≥40 chars)
If any criterion fails, system automatically triggers OCR fallback.
2

Tesseract OCR (Fast OCR)

Speed: Fast (2-5 seconds per page at 300 DPI)Use Case: Standard scanned documents, Hindi/English contentLanguages: Hindi (hin), English (eng), Kannada (kan), Tamil (tam), Telugu (tel), Bengali (ben), and moreProcess:
  1. Convert PDF to images (300 DPI)
  2. Grayscale conversion
  3. Contrast enhancement (2.0x)
  4. Tesseract OCR with language pack
  5. Confidence scoring per word
Configuration:
pytesseract.image_to_string(
    image,
    lang='hin+eng',      # Multi-language
    config='--psm 3 --oem 1'  # Auto segmentation, LSTM engine
)
Confidence Threshold: ≥60%If confidence <60% or <100 chars extracted → EasyOCR fallback
3

EasyOCR (High-Accuracy OCR)

Speed: Slower (10-30 seconds per page)Use Case: Complex Indian scripts, low-quality scans, or when Tesseract failsLanguages: 80+ including all Indian languagesTechnology: CNN + LSTM neural networksProcess:
  1. Load EasyOCR reader (lazy-loaded, GPU optional)
  2. Convert PDF to images (resized to max 1500px)
  3. Grayscale + contrast enhancement
  4. Deep learning OCR inference
  5. Confidence scoring
Subprocess Isolation:
# Runs in separate process to prevent OOM crashes
process = multiprocessing.Process(
    target=easyocr_worker,
    args=(pdf_bytes, num_pages, languages),
    daemon=True
)
process.start()
process.join(timeout=120)  # 120-second timeout
Memory Protection:
  • Max image dimension: 1500px
  • Subprocess timeout: 120 seconds
  • OOM detection: Exit code -9 or 137
If EasyOCR subprocess is OOM-killed, document processing fails gracefully without crashing the server.
4

PyMuPDF Fallback (Last Resort)

Speed: Medium (5-10 seconds per page)Use Case: PDFs with unusual image embeddings that pdf2image can’t renderTriggers:
  • All pdf2image-based OCR paths return 0 chars
  • Common with eOffice PDFs, JBIG2/CCITT compression, unusual XObjects
How It Works:
# PyMuPDF renders where Poppler fails
doc = fitz.open(stream=pdf_bytes, filetype="pdf")
zoom = dpi / 72.0
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
image = Image.open(io.BytesIO(pix.tobytes("png")))

# Then OCR with Tesseract
text = pytesseract.image_to_string(image)
Advantages:
  • Handles JBIG2 compression (common in government documents)
  • Renders CCITT Group 4 fax encoding
  • Supports unusual XObject structures

Automatic OCR Triggers

1. Low Text Density

Threshold: <150 characters per page
avg_chars_per_page = text_length / pages_extracted
if avg_chars_per_page < 150:
    # Trigger OCR
Example:
PyPDF2 extracted 342 chars from 3 pages (avg: 114 chars/page)
OCR needed: true (low density)

2. Garbled Encoding Detection

Problem: Legacy Indian PDFs (Krutidev, ISM fonts) map Devanagari glyphs to Latin Extended Unicode range (U+00A0–U+024F). Detection:
garbled_chars = sum(1 for c in text if '\u00a0' <= c <= '\u024f')
garbled_ratio = garbled_chars / len(text)

if garbled_ratio > 0.25:
    # Trigger OCR - text is actually font-mapping garbage
Example:
PyPDF2 output: "ºÉÚSÉxÉÉ ¨ÉÆjÉɱɪÉ"
Garbled encoding detected (47% Latin-Extended chars)
OCR needed: true
Legacy Indian document management systems used custom font encodings:
  • Krutidev: Maps Devanagari to ASCII range
  • ISM: Maps Indian scripts to extended Latin
  • eOffice: Custom font encodings
PyPDF2 reads the Unicode code points directly (e.g., \u00c9 = É), not the visual glyphs. This produces garbage text like “ºÉÚSÉxÉÉ” instead of the intended Hindi.Solution: OCR reads the visual glyphs directly, bypassing font encoding issues.

3. Sparse Content

Thresholds:
  • Total text: <500 characters
  • Words per page: <50
  • Substantive lines (<40 chars): <30%
if text_length < 500:
    # Trigger OCR

words_per_page = len(words) / pages_extracted
if words_per_page < 50 and substantive_ratio < 0.3:
    # Trigger OCR

4. No Text Lines Found

lines = [line.strip() for line in text.split('\n') if line.strip()]
if not lines:
    # Trigger OCR

Language Detection

Primary: langdetect

Library: langdetect (Google’s language detection) Process:
from langdetect import detect

# Use first 1000 chars for faster detection
sample = text[:1000]
detected_lang = detect(sample)  # Returns 'hi', 'en', 'kn', etc.
Supported Languages:
  • hi: Hindi
  • en: English
  • kn: Kannada
  • ta: Tamil
  • te: Telugu
  • bn: Bengali
  • gu: Gujarati
  • ml: Malayalam
  • mr: Marathi
  • pa: Punjabi
  • or: Odia
  • as: Assamese
  • ur: Urdu
  • sa: Sanskrit
  • ne: Nepali

Fallback: Script Detection

If language detection fails or returns unsupported language, use script analysis:
def detect_language_by_script(text):
    # Detect Unicode script ranges
    scripts = {
        'Devanagari': 0,  # U+0900-U+097F
        'Bengali': 0,     # U+0980-U+09FF
        'Kannada': 0,     # U+0C80-U+0CFF
        'Tamil': 0,       # U+0B80-U+0BFF
        'Telugu': 0,      # U+0C00-U+0C7F
        # ... more scripts
    }
    
    for char in text:
        if '\u0900' <= char <= '\u097F':
            scripts['Devanagari'] += 1
        elif '\u0980' <= char <= '\u09FF':
            scripts['Bengali'] += 1
        # ... more ranges
    
    # Return language for most common script
    primary_script = max(scripts.items(), key=lambda x: x[1])[0]
    return SCRIPT_TO_LANGUAGE_MAP[primary_script]
Script-to-Language Mapping:
  • Devanagari → Hindi (best Devanagari support)
  • Bengali/Assamese → Bengali
  • Kannada → Kannada
  • Tamil → Tamil
  • Telugu → Telugu
  • Gujarati → Gujarati
  • Malayalam → Malayalam
  • Gurmukhi → Punjabi

OCR Configuration by Language

Tesseract: hin+engEasyOCR: ['hi', 'en']Script: Devanagari (U+0900-U+097F)Strengths: Best support across all OCR engines

Confidence Scoring

Tesseract Confidence

ocr_data = pytesseract.image_to_data(
    image, 
    lang=languages,
    output_type=pytesseract.Output.DICT
)

# Extract valid confidence scores (>0)
valid_confidences = [c for c in ocr_data['conf'] if c > 0]
avg_confidence = sum(valid_confidences) / len(valid_confidences)
Interpretation:
  • 80-100%: Excellent quality, use Tesseract
  • 60-79%: Good quality, use Tesseract
  • <60%: Poor quality, fallback to EasyOCR

EasyOCR Confidence

reader = easyocr.Reader(['hi', 'en'])
results = reader.readtext(image, detail=1)

# results = [(bbox, text, confidence), ...]
confidences = [conf * 100 for (bbox, text, conf) in results]
avg_confidence = sum(confidences) / len(confidences)
Interpretation:
  • 90-100%: Excellent quality
  • 70-89%: Good quality
  • 50-69%: Acceptable quality
  • <50%: Poor quality (check scan quality)

Quality Assessment

Document Quality Tiers

def assess_document_quality(text, page_count, ocr_confidence):
    chars_per_page = len(text) / page_count
    
    if chars_per_page >= 150:
        # Digital document
        return {
            'type': 'digital',
            'quality_tier': 'high',
            'recommended_engine': 'pypdf2'
        }
    elif ocr_confidence >= 80:
        return {
            'type': 'scanned',
            'quality_tier': 'high',
            'recommended_engine': 'tesseract'
        }
    elif ocr_confidence >= 60:
        return {
            'type': 'scanned',
            'quality_tier': 'medium',
            'recommended_engine': 'tesseract'
        }
    else:
        return {
            'type': 'scanned',
            'quality_tier': 'low',
            'recommended_engine': 'easyocr'
        }

Response Metadata

{
  "is_scanned": true,
  "extraction_method": "easyocr",
  "ocr_confidence": 87.5,
  "detected_language": "hi",
  "language_name": "Hindi",
  "detection_method": "language_detection",
  "quality_info": {
    "type": "scanned",
    "text_density": 142.3,
    "ocr_confidence": 87.5,
    "quality_tier": "high",
    "recommended_engine": "tesseract"
  }
}

Image Preprocessing

Contrast Enhancement

from PIL import ImageEnhance

# Convert to grayscale
image = image.convert('L')

# Enhance contrast (2.0x)
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2.0)
Before/After:
  • Improves OCR accuracy on low-contrast scans
  • Reduces noise in background
  • Sharpens text edges

Image Resizing (EasyOCR)

# Limit max dimension to prevent OOM
max_dimension = 1500
w, h = image.size
max_dim = max(w, h)

if max_dim > max_dimension:
    scale = max_dimension / max_dim
    new_w = int(w * scale)
    new_h = int(h * scale)
    image = image.resize((new_w, new_h), Image.LANCZOS)
Why:
  • EasyOCR can OOM on large images (>2000px)
  • Resizing to 1500px maintains quality while reducing memory
  • Subprocess isolation prevents server crashes

Error Handling

Symptoms:
{
  "extraction_method": "pypdf2_no_ocr",
  "error": "Document needs OCR but OCR libraries not installed"
}
Solution:
# Install Tesseract
apt-get install tesseract-ocr tesseract-ocr-hin tesseract-ocr-eng

# Install Python libraries
pip install pytesseract pdf2image Pillow

# Install EasyOCR (optional, for high accuracy)
pip install easyocr torch torchvision
Symptoms:
{
  "extraction_method": "easyocr_failed",
  "error": "EasyOCR subprocess crashed (OOM killed)"
}
Cause: Large image + GPU/CPU memory exhaustedSolution:
  • Reduce DPI (300 → 200)
  • Reduce max image dimension (1500 → 1000)
  • Falls back to Tesseract automatically
Symptoms:
{
  "extraction_method": "pymupdf_tesseract_failed",
  "error": "PyMuPDF not installed"
}
Solution:
pip install pymupdf
Impact: Can’t render eOffice/JBIG2 PDFs (rare edge case)
Symptoms:
⚠️ Detected gibberish in Tesseract OCR output
Causes:
  • Very poor scan quality
  • Wrong language pack selected
  • Document uses unsupported script
Solution:
  • System automatically tries EasyOCR
  • Check extracted text preview
  • Try preprocessing PDF (increase contrast, denoise)

Performance Benchmarks

PyPDF2

Speed: <1 secondAccuracy: 100% (for text PDFs)Use: 70% of documents

Tesseract OCR

Speed: 2-5 seconds/pageAccuracy: 85-95% (Hindi/English)Use: 25% of documents

EasyOCR

Speed: 10-30 seconds/pageAccuracy: 95-99% (all languages)Use: 5% of documents

Best Practices

For best OCR results:
  • Use 300 DPI scans (higher DPI = slower but more accurate)
  • Ensure good lighting and contrast in original scans
  • Remove watermarks and background noise
  • Use color scans if possible (better quality retention)
For faster processing:
  • Reduce num_pages (1-2 pages usually sufficient for tagging)
  • Provide text-based PDFs when possible
  • Use lower DPI (200) for simple documents
Avoid common mistakes:
  • Don’t use very low DPI (<150) — OCR accuracy drops significantly
  • Don’t process handwritten documents (OCR designed for printed text)
  • Don’t expect 100% accuracy on degraded scans

Build docs developers (and LLMs) love