Overview
Invoice OCR supports PDF processing via OpenRouter’sfile-parser plugin. The plugin extracts text and structure from PDFs before vision model analysis, enabling accurate extraction from multi-page invoices.
PDF support is available in all three OCR modes:
/api/ocr, /api/ocr-structured, and /api/ocr-structured-v4.Quick Start
Choose Engine
Configure PDF parsing engine (optional):
- Environment Variable (Global)
- Request Body (Per-Request)
Set in
.env.local:File-Parser Plugin
What It Does
Thefile-parser plugin pre-processes PDFs before they reach the vision model:
- Text Extraction: Pulls embedded text from PDF structure
- Layout Analysis: Detects tables, headers, line items
- Page Segmentation: Splits multi-page documents into logical sections
- Metadata Enrichment: Adds bounding boxes, confidence scores, page numbers
This pre-processing significantly improves extraction quality for complex invoices with tables, multiple pages, and dense layouts.
Configuration
PDF Engines
pdf-text
DefaultFast text extraction from PDF structure
mistral-ocr
High AccuracyOCR-based extraction for scanned PDFs
native
Model-SpecificUses vision model’s native PDF support
Engine Comparison
| Engine | Best For | Speed | Quality | Cost |
|---|---|---|---|---|
| pdf-text | Digital PDFs with embedded text | ⚡ Fast | ⭐⭐⭐⭐ (digital) ⭐⭐ (scanned) | 💰 Low |
| mistral-ocr | Scanned PDFs, images of documents | 🐌 Slow | ⭐⭐⭐⭐⭐ | 💰💰💰 High |
| native | Models with built-in PDF support | ⚡⚡ Very Fast | ⭐⭐⭐⭐ (varies) | 💰💰 Medium |
When to Use Each
- pdf-text (Default)
- mistral-ocr
- native
Use Cases:
- Invoices generated from billing software (QuickBooks, SAP, etc.)
- PDFs exported from Word/Excel
- Digital documents with selectable text
- Fast processing (< 2 seconds per page)
- Low cost
- Preserves text structure and formatting
- Poor quality on scanned/photographed documents
- May miss handwritten annotations
- Requires embedded text in PDF
Input Formats
1. Base64 (Recommended for Browser Uploads)
The API accepts both:
- Data URLs:
"data:application/pdf;base64,JVBERi0xLjQK..." - Raw base64:
"JVBERi0xLjQK..."
2. Public URL (Recommended for Server-Side)
Multi-Page Handling
How It Works
The V4 system prompt explicitly instructs models to process all pages:Extracting Across Pages
Validating Page Count
Performance Optimization
1. Use Annotations for Re-Processing
OpenRouter returnsannotations after the first parse. Pass them back to skip re-parsing:
2. Choose the Right Engine
| Document Type | Recommended Engine | Avg. Time (5 pages) |
|---|---|---|
| Digital invoices (software-generated) | pdf-text | 3-5 seconds |
| Scanned invoices (clear) | mistral-ocr | 20-30 seconds |
| Scanned invoices (poor quality) | mistral-ocr + retry | 30-45 seconds |
| Mixed (some digital, some scanned) | pdf-text → mistral-ocr fallback | 5-30 seconds |
3. Implement Smart Fallback
Troubleshooting
Missing Items from Later Pages
Missing Items from Later Pages
Symptoms:
items.lengthis less than expectedmeta.pages_processed< actual page count- Reconciliation error is very large
-
Check if PDF has embedded text:
If output is garbled/empty → use
mistral-ocr -
Verify page count in response:
-
Try native mode (model-dependent):
HSN Table Not Detected
HSN Table Not Detected
Symptoms:
printed.hsn_tax_tableis empty array- Reconciliation uses fallback logic (worse accuracy)
-
Ensure HSN table is on a page that was processed:
-
Try mistral-ocr for better table detection:
-
Check if table is labeled (“HSN/SAC Summary”, “Tax Details”):
- Models look for these keywords
- If missing, table might not be recognized
Slow Processing (> 60 seconds)
Slow Processing (> 60 seconds)
Causes:
- Large PDF (> 10 pages)
- Using mistral-ocr on digital PDF (overkill)
- Model timeout/overload
-
Split PDF into smaller chunks:
-
Switch to pdf-text for digital PDFs:
-
Use annotations for subsequent requests:
Poor Quality on Scanned PDFs
Poor Quality on Scanned PDFs
Symptoms:
- Gibberish text in items
- Numbers are incorrect
- Confidence scores < 0.7
-
Force mistral-ocr:
-
Pre-process PDF:
- Deskew rotated pages
- Increase DPI (300+ recommended)
- Convert to grayscale
- Remove noise/artifacts
-
Try different model:
Some models handle OCR output better than others.
Best Practices
Use pdf-text by Default
Start with
pdf-text for speed and cost. Fallback to mistral-ocr only if extraction quality is poor.Cache Annotations
Store OpenRouter
annotations in your database. Reuse them for re-processing the same PDF with different schemas.Validate Page Count
Always check
meta.pages_processed matches expected page count. Missing pages = missing data.Monitor Reconciliation Errors
Track
reconciliation.error_absolute across PDFs. Sudden increase may indicate PDF quality issues.Related Topics
OCR Modes
Choose between Raw, Structured, and V4 modes
Reconciliation Engine
How multi-page HSN tables anchor reconciliation
