Skip to main content

Overview

Invoice OCR supports PDF processing via OpenRouter’s file-parser plugin. The plugin extracts text and structure from PDFs before vision model analysis, enabling accurate extraction from multi-page invoices.
PDF support is available in all three OCR modes: /api/ocr, /api/ocr-structured, and /api/ocr-structured-v4.

Quick Start

1

Upload PDF

Send PDF as base64 or public URL:
const pdfBase64 = await fileToBase64(pdfFile);

const response = await fetch('/api/ocr-structured-v4', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    pdfBase64,
    filename: 'invoice.pdf'
  })
});
2

Choose Engine

Configure PDF parsing engine (optional):
Set in .env.local:
OPENROUTER_PDF_ENGINE=pdf-text
3

Process Response

Handle multi-page metadata:
const doc: V4Doc = await response.json();

console.log(`Processed ${doc.meta.pages_processed} pages`);
console.log(`Total items: ${doc.items.length}`);
console.log(`HSN table rows: ${doc.printed.hsn_tax_table.length}`);

File-Parser Plugin

What It Does

The file-parser plugin pre-processes PDFs before they reach the vision model:
  1. Text Extraction: Pulls embedded text from PDF structure
  2. Layout Analysis: Detects tables, headers, line items
  3. Page Segmentation: Splits multi-page documents into logical sections
  4. Metadata Enrichment: Adds bounding boxes, confidence scores, page numbers
This pre-processing significantly improves extraction quality for complex invoices with tables, multiple pages, and dense layouts.

Configuration

// Default configuration (applied automatically)
const defaultPlugin = {
  id: "file-parser",
  pdf: { engine: "pdf-text" }
};

// Custom configuration
const customPlugin = {
  id: "file-parser",
  pdf: {
    engine: "mistral-ocr", // or "native"
    // Additional options (OpenRouter-specific)
  }
};

PDF Engines

pdf-text

DefaultFast text extraction from PDF structure

mistral-ocr

High AccuracyOCR-based extraction for scanned PDFs

native

Model-SpecificUses vision model’s native PDF support

Engine Comparison

EngineBest ForSpeedQualityCost
pdf-textDigital PDFs with embedded text⚡ Fast⭐⭐⭐⭐ (digital)
⭐⭐ (scanned)
💰 Low
mistral-ocrScanned PDFs, images of documents🐌 Slow⭐⭐⭐⭐⭐💰💰💰 High
nativeModels with built-in PDF support⚡⚡ Very Fast⭐⭐⭐⭐ (varies)💰💰 Medium

When to Use Each

Use Cases:
  • Invoices generated from billing software (QuickBooks, SAP, etc.)
  • PDFs exported from Word/Excel
  • Digital documents with selectable text
Advantages:
  • Fast processing (< 2 seconds per page)
  • Low cost
  • Preserves text structure and formatting
Limitations:
  • Poor quality on scanned/photographed documents
  • May miss handwritten annotations
  • Requires embedded text in PDF
Example:
# Set in .env.local
OPENROUTER_PDF_ENGINE=pdf-text

Input Formats

function fileToBase64(file: File): Promise<string> {
  return new Promise((resolve, reject) => {
    const reader = new FileReader();
    reader.onload = () => {
      const result = reader.result as string;
      // Result is already a data URL: "data:application/pdf;base64,..."
      resolve(result);
    };
    reader.onerror = reject;
    reader.readAsDataURL(file);
  });
}

const pdfBase64 = await fileToBase64(pdfFile);

await fetch('/api/ocr-structured-v4', {
  method: 'POST',
  body: JSON.stringify({
    pdfBase64, // "data:application/pdf;base64,..."
    filename: pdfFile.name
  })
});
The API accepts both:
  • Data URLs: "data:application/pdf;base64,JVBERi0xLjQK..."
  • Raw base64: "JVBERi0xLjQK..."
Both are automatically normalized to data URLs internally.
// For publicly accessible PDFs
await fetch('/api/ocr-structured-v4', {
  method: 'POST',
  body: JSON.stringify({
    pdfUrl: 'https://example.com/invoices/2024/january/invoice-001.pdf',
    model: 'google/gemini-2.0-flash'
  })
});
The URL must be publicly accessible (no authentication required). OpenRouter servers fetch the PDF directly.For private PDFs, use base64 encoding instead.

Multi-Page Handling

How It Works

The V4 system prompt explicitly instructs models to process all pages:
const SYSTEM_PROMPT = `
# Response Rules
- If a PDF is provided, consider ALL pages.
- Prefer HSN tables and tax summaries found on any page as anchors.
...
`;

Extracting Across Pages

1

Page 1: Line Items

{
  "items": [
    { "name": "Laptop", "qty": 2, "rate_ex_tax": 45000 },
    { "name": "Mouse", "qty": 10, "rate_ex_tax": 500 }
  ]
}
2

Page 2: HSN Tax Table

{
  "printed": {
    "hsn_tax_table": [
      {
        "hsn": "8471",
        "taxable_value": 95000,
        "cgst_rate": 9,
        "sgst_rate": 9,
        "cgst_amount": 8550,
        "sgst_amount": 8550
      }
    ]
  }
}
3

Reconciliation

The engine uses the HSN table from Page 2 to anchor items from Page 1:
// Scale items to match HSN table taxable value
const printedTaxable = 95000;
const computedTaxable = 2 * 45000 + 10 * 500; // = 95000 ✓

// No scaling needed, perfect match

Validating Page Count

const doc: V4Doc = await response.json();

if (doc.meta.pages_processed < expectedPageCount) {
  console.warn('Not all pages were processed');
  console.log(`Expected: ${expectedPageCount}, Got: ${doc.meta.pages_processed}`);
  
  // Try with mistral-ocr engine
  const retryResponse = await fetch('/api/ocr-structured-v4', {
    method: 'POST',
    body: JSON.stringify({
      pdfBase64,
      plugins: [{ id: 'file-parser', pdf: { engine: 'mistral-ocr' } }]
    })
  });
}

Performance Optimization

1. Use Annotations for Re-Processing

OpenRouter returns annotations after the first parse. Pass them back to skip re-parsing:
// First request: Full parse
const firstResponse = await fetch('https://openrouter.ai/api/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${apiKey}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'google/gemini-2.0-flash',
    messages: [...],
    plugins: [{ id: 'file-parser', pdf: { engine: 'pdf-text' } }]
  })
});

const firstData = await firstResponse.json();
const annotations = firstData.annotations; // Save this!

// Second request: Use annotations to skip re-parse
const secondResponse = await fetch('/api/ocr-structured-v4', {
  method: 'POST',
  body: JSON.stringify({
    pdfBase64: samePdf,
    annotations, // Reuse parse results
    // Different prompt/schema can be used without re-parsing
  })
});
Annotations save 5-10 seconds and reduce costs by 50-80% for large PDFs. Store them if you need to re-process the same PDF with different parameters.

2. Choose the Right Engine

Document TypeRecommended EngineAvg. Time (5 pages)
Digital invoices (software-generated)pdf-text3-5 seconds
Scanned invoices (clear)mistral-ocr20-30 seconds
Scanned invoices (poor quality)mistral-ocr + retry30-45 seconds
Mixed (some digital, some scanned)pdf-textmistral-ocr fallback5-30 seconds

3. Implement Smart Fallback

async function extractInvoice(pdfBase64: string): Promise<V4Doc> {
  // Try fast engine first
  let response = await fetch('/api/ocr-structured-v4', {
    method: 'POST',
    body: JSON.stringify({
      pdfBase64,
      plugins: [{ id: 'file-parser', pdf: { engine: 'pdf-text' } }]
    })
  });
  
  let doc: V4Doc = await response.json();
  
  // Check extraction quality
  if (doc.items.length === 0 || doc.reconciliation.error_absolute > 100) {
    console.log('Poor quality with pdf-text, retrying with mistral-ocr...');
    
    response = await fetch('/api/ocr-structured-v4', {
      method: 'POST',
      body: JSON.stringify({
        pdfBase64,
        plugins: [{ id: 'file-parser', pdf: { engine: 'mistral-ocr' } }]
      })
    });
    
    doc = await response.json();
  }
  
  return doc;
}

Troubleshooting

Symptoms:
  • items.length is less than expected
  • meta.pages_processed < actual page count
  • Reconciliation error is very large
Solutions:
  1. Check if PDF has embedded text:
    pdftotext invoice.pdf - | head
    
    If output is garbled/empty → use mistral-ocr
  2. Verify page count in response:
    if (doc.meta.pages_processed < pdfPageCount) {
      // Try mistral-ocr or native
    }
    
  3. Try native mode (model-dependent):
    {
      "plugins": [{ "id": "file-parser", "pdf": { "engine": "native" } }]
    }
    
Symptoms:
  • printed.hsn_tax_table is empty array
  • Reconciliation uses fallback logic (worse accuracy)
Solutions:
  1. Ensure HSN table is on a page that was processed:
    console.log(`Pages processed: ${doc.meta.pages_processed}`);
    
  2. Try mistral-ocr for better table detection:
    { "plugins": [{ "id": "file-parser", "pdf": { "engine": "mistral-ocr" } }] }
    
  3. Check if table is labeled (“HSN/SAC Summary”, “Tax Details”):
    • Models look for these keywords
    • If missing, table might not be recognized
Causes:
  • Large PDF (> 10 pages)
  • Using mistral-ocr on digital PDF (overkill)
  • Model timeout/overload
Solutions:
  1. Split PDF into smaller chunks:
    // Process invoice page + HSN table page separately
    const mainPages = await extractPages(pdf, [0, 1]);
    const tablePage = await extractPages(pdf, [2]);
    
  2. Switch to pdf-text for digital PDFs:
    OPENROUTER_PDF_ENGINE=pdf-text
    
  3. Use annotations for subsequent requests:
    const { annotations } = await firstRequest();
    await secondRequest({ annotations }); // Much faster
    
Symptoms:
  • Gibberish text in items
  • Numbers are incorrect
  • Confidence scores < 0.7
Solutions:
  1. Force mistral-ocr:
    { "plugins": [{ "id": "file-parser", "pdf": { "engine": "mistral-ocr" } }] }
    
  2. Pre-process PDF:
    • Deskew rotated pages
    • Increase DPI (300+ recommended)
    • Convert to grayscale
    • Remove noise/artifacts
  3. Try different model:
    {
      "model": "anthropic/claude-3.5-sonnet",
      "plugins": [{ "id": "file-parser", "pdf": { "engine": "mistral-ocr" } }]
    }
    
    Some models handle OCR output better than others.

Best Practices

Use pdf-text by Default

Start with pdf-text for speed and cost. Fallback to mistral-ocr only if extraction quality is poor.

Cache Annotations

Store OpenRouter annotations in your database. Reuse them for re-processing the same PDF with different schemas.

Validate Page Count

Always check meta.pages_processed matches expected page count. Missing pages = missing data.

Monitor Reconciliation Errors

Track reconciliation.error_absolute across PDFs. Sudden increase may indicate PDF quality issues.

OCR Modes

Choose between Raw, Structured, and V4 modes

Reconciliation Engine

How multi-page HSN tables anchor reconciliation

Build docs developers (and LLMs) love