Overview
The system uses a 3-tier hybrid OCR strategy to extract text from any PDF document — whether digitally created, scanned, or using legacy font encodings. The intelligent extraction pipeline automatically selects the optimal method based on document characteristics.Supports 80+ languages including all major Indian languages (Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, Gujarati, Punjabi, Odia, Assamese, Urdu, Sanskrit, Nepali).
Extraction Pipeline
PyPDF2 (Text-Based PDFs)
Speed: Fastest (< 1 second)Use Case: Digitally-created PDFs with embedded textHow It Works:Success Criteria:
- ≥150 characters per page extracted
- No garbled encoding detected
- ≥50 words per page
- ≥30% substantive lines (≥40 chars)
If any criterion fails, system automatically triggers OCR fallback.
Tesseract OCR (Fast OCR)
Speed: Fast (2-5 seconds per page at 300 DPI)Use Case: Standard scanned documents, Hindi/English contentLanguages: Hindi (Confidence Threshold: ≥60%If confidence <60% or <100 chars extracted → EasyOCR fallback
hin), English (eng), Kannada (kan), Tamil (tam), Telugu (tel), Bengali (ben), and moreProcess:- Convert PDF to images (300 DPI)
- Grayscale conversion
- Contrast enhancement (2.0x)
- Tesseract OCR with language pack
- Confidence scoring per word
EasyOCR (High-Accuracy OCR)
Speed: Slower (10-30 seconds per page)Use Case: Complex Indian scripts, low-quality scans, or when Tesseract failsLanguages: 80+ including all Indian languagesTechnology: CNN + LSTM neural networksProcess:Memory Protection:
- Load EasyOCR reader (lazy-loaded, GPU optional)
- Convert PDF to images (resized to max 1500px)
- Grayscale + contrast enhancement
- Deep learning OCR inference
- Confidence scoring
- Max image dimension: 1500px
- Subprocess timeout: 120 seconds
- OOM detection: Exit code -9 or 137
PyMuPDF Fallback (Last Resort)
Speed: Medium (5-10 seconds per page)Use Case: PDFs with unusual image embeddings that pdf2image can’t renderTriggers:Advantages:
- All pdf2image-based OCR paths return 0 chars
- Common with eOffice PDFs, JBIG2/CCITT compression, unusual XObjects
- Handles JBIG2 compression (common in government documents)
- Renders CCITT Group 4 fax encoding
- Supports unusual XObject structures
Automatic OCR Triggers
1. Low Text Density
Threshold: <150 characters per page2. Garbled Encoding Detection
Problem: Legacy Indian PDFs (Krutidev, ISM fonts) map Devanagari glyphs to Latin Extended Unicode range (U+00A0–U+024F). Detection:Why This Happens
Why This Happens
Legacy Indian document management systems used custom font encodings:
- Krutidev: Maps Devanagari to ASCII range
- ISM: Maps Indian scripts to extended Latin
- eOffice: Custom font encodings
\u00c9 = É), not the visual glyphs. This produces garbage text like “ºÉÚSÉxÉÉ” instead of the intended Hindi.Solution: OCR reads the visual glyphs directly, bypassing font encoding issues.3. Sparse Content
Thresholds:- Total text: <500 characters
- Words per page: <50
- Substantive lines (<40 chars): <30%
4. No Text Lines Found
Language Detection
Primary: langdetect
Library:langdetect (Google’s language detection)
Process:
hi: Hindien: Englishkn: Kannadata: Tamilte: Telugubn: Bengaligu: Gujaratiml: Malayalammr: Marathipa: Punjabior: Odiaas: Assameseur: Urdusa: Sanskritne: Nepali
Fallback: Script Detection
If language detection fails or returns unsupported language, use script analysis:- Devanagari → Hindi (best Devanagari support)
- Bengali/Assamese → Bengali
- Kannada → Kannada
- Tamil → Tamil
- Telugu → Telugu
- Gujarati → Gujarati
- Malayalam → Malayalam
- Gurmukhi → Punjabi
OCR Configuration by Language
- Hindi
- Tamil
- Telugu
- Bengali
- Kannada
Tesseract:
hin+engEasyOCR: ['hi', 'en']Script: Devanagari (U+0900-U+097F)Strengths: Best support across all OCR enginesConfidence Scoring
Tesseract Confidence
- 80-100%: Excellent quality, use Tesseract
- 60-79%: Good quality, use Tesseract
- <60%: Poor quality, fallback to EasyOCR
EasyOCR Confidence
- 90-100%: Excellent quality
- 70-89%: Good quality
- 50-69%: Acceptable quality
- <50%: Poor quality (check scan quality)
Quality Assessment
Document Quality Tiers
Response Metadata
Image Preprocessing
Contrast Enhancement
- Improves OCR accuracy on low-contrast scans
- Reduces noise in background
- Sharpens text edges
Image Resizing (EasyOCR)
- EasyOCR can OOM on large images (>2000px)
- Resizing to 1500px maintains quality while reducing memory
- Subprocess isolation prevents server crashes
Error Handling
OCR Libraries Not Installed
OCR Libraries Not Installed
Symptoms:Solution:
EasyOCR OOM (Out of Memory)
EasyOCR OOM (Out of Memory)
Symptoms:Cause: Large image + GPU/CPU memory exhaustedSolution:
- Reduce DPI (300 → 200)
- Reduce max image dimension (1500 → 1000)
- Falls back to Tesseract automatically
PyMuPDF Not Available
PyMuPDF Not Available
Symptoms:Solution:Impact: Can’t render eOffice/JBIG2 PDFs (rare edge case)
Gibberish in OCR Output
Gibberish in OCR Output
Symptoms:Causes:
- Very poor scan quality
- Wrong language pack selected
- Document uses unsupported script
- System automatically tries EasyOCR
- Check extracted text preview
- Try preprocessing PDF (increase contrast, denoise)
Performance Benchmarks
PyPDF2
Speed: <1 secondAccuracy: 100% (for text PDFs)Use: 70% of documents
Tesseract OCR
Speed: 2-5 seconds/pageAccuracy: 85-95% (Hindi/English)Use: 25% of documents
EasyOCR
Speed: 10-30 seconds/pageAccuracy: 95-99% (all languages)Use: 5% of documents