PDF Support

Overview

Invoice OCR supports PDF processing via OpenRouter’s file-parser plugin. The plugin extracts text and structure from PDFs before vision model analysis, enabling accurate extraction from multi-page invoices.

PDF support is available in all three OCR modes: /api/ocr, /api/ocr-structured, and /api/ocr-structured-v4.

Quick Start

Upload PDF

Send PDF as base64 or public URL:

const pdfBase64 = await fileToBase64(pdfFile);

const response = await fetch('/api/ocr-structured-v4', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    pdfBase64,
    filename: 'invoice.pdf'
  })
});

Choose Engine

Configure PDF parsing engine (optional):

Environment Variable (Global)
Request Body (Per-Request)

Set in .env.local:

OPENROUTER_PDF_ENGINE=pdf-text

Override per request:

{
  "pdfBase64": "...",
  "plugins": [
    {
      "id": "file-parser",
      "pdf": { "engine": "mistral-ocr" }
    }
  ]
}

Process Response

Handle multi-page metadata:

const doc: V4Doc = await response.json();

console.log(`Processed ${doc.meta.pages_processed} pages`);
console.log(`Total items: ${doc.items.length}`);
console.log(`HSN table rows: ${doc.printed.hsn_tax_table.length}`);

File-Parser Plugin

What It Does

The file-parser plugin pre-processes PDFs before they reach the vision model:

Text Extraction: Pulls embedded text from PDF structure
Layout Analysis: Detects tables, headers, line items
Page Segmentation: Splits multi-page documents into logical sections
Metadata Enrichment: Adds bounding boxes, confidence scores, page numbers

This pre-processing significantly improves extraction quality for complex invoices with tables, multiple pages, and dense layouts.

Configuration

// Default configuration (applied automatically)
const defaultPlugin = {
  id: "file-parser",
  pdf: { engine: "pdf-text" }
};

// Custom configuration
const customPlugin = {
  id: "file-parser",
  pdf: {
    engine: "mistral-ocr", // or "native"
    // Additional options (OpenRouter-specific)
  }
};

PDF Engines

pdf-text

DefaultFast text extraction from PDF structure

mistral-ocr

High AccuracyOCR-based extraction for scanned PDFs

native

Model-SpecificUses vision model’s native PDF support

Engine Comparison

Engine	Best For	Speed	Quality	Cost
pdf-text	Digital PDFs with embedded text	⚡ Fast	⭐⭐⭐⭐ (digital) ⭐⭐ (scanned)	💰 Low
mistral-ocr	Scanned PDFs, images of documents	🐌 Slow	⭐⭐⭐⭐⭐	💰💰💰 High
native	Models with built-in PDF support	⚡⚡ Very Fast	⭐⭐⭐⭐ (varies)	💰💰 Medium

When to Use Each

pdf-text (Default)
mistral-ocr
native

Use Cases:

Invoices generated from billing software (QuickBooks, SAP, etc.)
PDFs exported from Word/Excel
Digital documents with selectable text

Advantages:

Fast processing (< 2 seconds per page)
Low cost
Preserves text structure and formatting

Limitations:

Poor quality on scanned/photographed documents
May miss handwritten annotations
Requires embedded text in PDF

Example:

# Set in .env.local
OPENROUTER_PDF_ENGINE=pdf-text

Use Cases:

Scanned paper invoices
Photos/screenshots of invoices
Low-quality PDFs with poor text embedding
Handwritten or mixed content

Advantages:

Highest accuracy on scanned documents
Handles rotated/skewed pages
Recognizes handwritten notes

Limitations:

Slow (5-10 seconds per page)
Higher cost (3-5× vs. pdf-text)
May introduce OCR errors

Example:

{
  "pdfBase64": "...",
  "plugins": [
    {
      "id": "file-parser",
      "pdf": { "engine": "mistral-ocr" }
    }
  ]
}

Use Cases:

Models that natively support PDF input (e.g., Claude, some Gemini variants)
When you want to bypass file-parser for debugging
Cost optimization (skips plugin fees)

Advantages:

Very fast (model-dependent)
No plugin overhead
May leverage model-specific optimizations

Limitations:

Not all models support native mode
Quality varies by model
Less control over pre-processing

Example:

const response = await fetch('/api/ocr-structured-v4', {
  method: 'POST',
  body: JSON.stringify({
    pdfUrl: 'https://example.com/invoice.pdf',
    model: 'anthropic/claude-3.5-sonnet',
    plugins: [{ id: 'file-parser', pdf: { engine: 'native' } }]
  })
});

Input Formats

1. Base64 (Recommended for Browser Uploads)

function fileToBase64(file: File): Promise<string> {
  return new Promise((resolve, reject) => {
    const reader = new FileReader();
    reader.onload = () => {
      const result = reader.result as string;
      // Result is already a data URL: "data:application/pdf;base64,..."
      resolve(result);
    };
    reader.onerror = reject;
    reader.readAsDataURL(file);
  });
}

const pdfBase64 = await fileToBase64(pdfFile);

await fetch('/api/ocr-structured-v4', {
  method: 'POST',
  body: JSON.stringify({
    pdfBase64, // "data:application/pdf;base64,..."
    filename: pdfFile.name
  })
});

The API accepts both:

Data URLs: "data:application/pdf;base64,JVBERi0xLjQK..."
Raw base64: "JVBERi0xLjQK..."

Both are automatically normalized to data URLs internally.

2. Public URL (Recommended for Server-Side)

// For publicly accessible PDFs
await fetch('/api/ocr-structured-v4', {
  method: 'POST',
  body: JSON.stringify({
    pdfUrl: 'https://example.com/invoices/2024/january/invoice-001.pdf',
    model: 'google/gemini-2.0-flash'
  })
});

The URL must be publicly accessible (no authentication required). OpenRouter servers fetch the PDF directly.For private PDFs, use base64 encoding instead.

Multi-Page Handling

How It Works

The V4 system prompt explicitly instructs models to process all pages:

const SYSTEM_PROMPT = `
# Response Rules
- If a PDF is provided, consider ALL pages.
- Prefer HSN tables and tax summaries found on any page as anchors.
...
`;

Extracting Across Pages

Page 1: Line Items

{
  "items": [
    { "name": "Laptop", "qty": 2, "rate_ex_tax": 45000 },
    { "name": "Mouse", "qty": 10, "rate_ex_tax": 500 }
  ]
}

Page 2: HSN Tax Table

{
  "printed": {
    "hsn_tax_table": [
      {
        "hsn": "8471",
        "taxable_value": 95000,
        "cgst_rate": 9,
        "sgst_rate": 9,
        "cgst_amount": 8550,
        "sgst_amount": 8550
      }
    ]
  }
}

Reconciliation

The engine uses the HSN table from Page 2 to anchor items from Page 1:

// Scale items to match HSN table taxable value
const printedTaxable = 95000;
const computedTaxable = 2 * 45000 + 10 * 500; // = 95000 ✓

// No scaling needed, perfect match

Validating Page Count

const doc: V4Doc = await response.json();

if (doc.meta.pages_processed < expectedPageCount) {
  console.warn('Not all pages were processed');
  console.log(`Expected: ${expectedPageCount}, Got: ${doc.meta.pages_processed}`);
  
  // Try with mistral-ocr engine
  const retryResponse = await fetch('/api/ocr-structured-v4', {
    method: 'POST',
    body: JSON.stringify({
      pdfBase64,
      plugins: [{ id: 'file-parser', pdf: { engine: 'mistral-ocr' } }]
    })
  });
}

Performance Optimization

1. Use Annotations for Re-Processing

OpenRouter returns annotations after the first parse. Pass them back to skip re-parsing:

// First request: Full parse
const firstResponse = await fetch('https://openrouter.ai/api/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${apiKey}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'google/gemini-2.0-flash',
    messages: [...],
    plugins: [{ id: 'file-parser', pdf: { engine: 'pdf-text' } }]
  })
});

const firstData = await firstResponse.json();
const annotations = firstData.annotations; // Save this!

// Second request: Use annotations to skip re-parse
const secondResponse = await fetch('/api/ocr-structured-v4', {
  method: 'POST',
  body: JSON.stringify({
    pdfBase64: samePdf,
    annotations, // Reuse parse results
    // Different prompt/schema can be used without re-parsing
  })
});

Annotations save 5-10 seconds and reduce costs by 50-80% for large PDFs. Store them if you need to re-process the same PDF with different parameters.

2. Choose the Right Engine

Document Type	Recommended Engine	Avg. Time (5 pages)
Digital invoices (software-generated)	`pdf-text`	3-5 seconds
Scanned invoices (clear)	`mistral-ocr`	20-30 seconds
Scanned invoices (poor quality)	`mistral-ocr` + retry	30-45 seconds
Mixed (some digital, some scanned)	`pdf-text` → `mistral-ocr` fallback	5-30 seconds

3. Implement Smart Fallback

async function extractInvoice(pdfBase64: string): Promise<V4Doc> {
  // Try fast engine first
  let response = await fetch('/api/ocr-structured-v4', {
    method: 'POST',
    body: JSON.stringify({
      pdfBase64,
      plugins: [{ id: 'file-parser', pdf: { engine: 'pdf-text' } }]
    })
  });
  
  let doc: V4Doc = await response.json();
  
  // Check extraction quality
  if (doc.items.length === 0 || doc.reconciliation.error_absolute > 100) {
    console.log('Poor quality with pdf-text, retrying with mistral-ocr...');
    
    response = await fetch('/api/ocr-structured-v4', {
      method: 'POST',
      body: JSON.stringify({
        pdfBase64,
        plugins: [{ id: 'file-parser', pdf: { engine: 'mistral-ocr' } }]
      })
    });
    
    doc = await response.json();
  }
  
  return doc;
}

Troubleshooting

Missing Items from Later Pages

Symptoms:

items.length is less than expected
meta.pages_processed < actual page count
Reconciliation error is very large

Solutions:

Check if PDF has embedded text:
```
pdftotext invoice.pdf - | head
```
If output is garbled/empty → use mistral-ocr

Verify page count in response:

if (doc.meta.pages_processed < pdfPageCount) {
  // Try mistral-ocr or native
}

Try native mode (model-dependent):

{
  "plugins": [{ "id": "file-parser", "pdf": { "engine": "native" } }]
}

HSN Table Not Detected

Symptoms:

printed.hsn_tax_table is empty array
Reconciliation uses fallback logic (worse accuracy)

Solutions:

Ensure HSN table is on a page that was processed:

console.log(`Pages processed: ${doc.meta.pages_processed}`);

Try mistral-ocr for better table detection:

{ "plugins": [{ "id": "file-parser", "pdf": { "engine": "mistral-ocr" } }] }

Check if table is labeled (“HSN/SAC Summary”, “Tax Details”):
- Models look for these keywords
- If missing, table might not be recognized

Slow Processing (> 60 seconds)

Causes:

Large PDF (> 10 pages)
Using mistral-ocr on digital PDF (overkill)
Model timeout/overload

Solutions:

Split PDF into smaller chunks:

// Process invoice page + HSN table page separately
const mainPages = await extractPages(pdf, [0, 1]);
const tablePage = await extractPages(pdf, [2]);

Switch to pdf-text for digital PDFs:
```
OPENROUTER_PDF_ENGINE=pdf-text
```

Use annotations for subsequent requests:

const { annotations } = await firstRequest();
await secondRequest({ annotations }); // Much faster

Poor Quality on Scanned PDFs

Symptoms:

Gibberish text in items
Numbers are incorrect
Confidence scores < 0.7

Solutions:

Force mistral-ocr:

{ "plugins": [{ "id": "file-parser", "pdf": { "engine": "mistral-ocr" } }] }

Pre-process PDF:
- Deskew rotated pages
- Increase DPI (300+ recommended)
- Convert to grayscale
- Remove noise/artifacts

Try different model:

{
  "model": "anthropic/claude-3.5-sonnet",
  "plugins": [{ "id": "file-parser", "pdf": { "engine": "mistral-ocr" } }]
}

Some models handle OCR output better than others.

Best Practices

Use pdf-text by Default

Start with pdf-text for speed and cost. Fallback to mistral-ocr only if extraction quality is poor.

Cache Annotations

Store OpenRouter annotations in your database. Reuse them for re-processing the same PDF with different schemas.

Validate Page Count

Always check meta.pages_processed matches expected page count. Missing pages = missing data.

Monitor Reconciliation Errors

Track reconciliation.error_absolute across PDFs. Sudden increase may indicate PDF quality issues.

OCR Modes

Choose between Raw, Structured, and V4 modes

Reconciliation Engine

How multi-page HSN tables anchor reconciliation

Get Started

Core Features

Guides

Configuration

Overview

Quick Start

File-Parser Plugin

What It Does

Configuration

PDF Engines

pdf-text

mistral-ocr

native

Engine Comparison

When to Use Each

Input Formats

1. Base64 (Recommended for Browser Uploads)

2. Public URL (Recommended for Server-Side)

Multi-Page Handling

How It Works

Extracting Across Pages

Validating Page Count

Performance Optimization

1. Use Annotations for Re-Processing

2. Choose the Right Engine

3. Implement Smart Fallback

Troubleshooting

Best Practices

Use pdf-text by Default

Cache Annotations

Validate Page Count

Monitor Reconciliation Errors

OCR Modes

Reconciliation Engine

Build docs developers (and LLMs) love

Get Started

Core Features

Guides

Configuration

​Overview

​Quick Start

​File-Parser Plugin

​What It Does

​Configuration

​PDF Engines

pdf-text

mistral-ocr

native

​Engine Comparison

​When to Use Each

​Input Formats

​1. Base64 (Recommended for Browser Uploads)

​2. Public URL (Recommended for Server-Side)

​Multi-Page Handling

​How It Works

​Extracting Across Pages

​Validating Page Count

​Performance Optimization

​1. Use Annotations for Re-Processing

​2. Choose the Right Engine

​3. Implement Smart Fallback

​Troubleshooting

​Best Practices

Use pdf-text by Default

Cache Annotations

Validate Page Count

Monitor Reconciliation Errors

​Related Topics

OCR Modes

Reconciliation Engine

Build docs developers (and LLMs) love

Overview

Quick Start

File-Parser Plugin

What It Does

Configuration

PDF Engines

Engine Comparison

When to Use Each

Input Formats

1. Base64 (Recommended for Browser Uploads)

2. Public URL (Recommended for Server-Side)

Multi-Page Handling

How It Works

Extracting Across Pages

Validating Page Count

Performance Optimization

1. Use Annotations for Re-Processing

2. Choose the Right Engine

3. Implement Smart Fallback

Troubleshooting

Best Practices

Related Topics