Model comparison

The Model Comparison tool (/compare) lets you process the same invoice with multiple vision models simultaneously and compare their accuracy, speed, and output quality.

Overview

Compare Page Features

✓ Run up to 10 models concurrently
✓ Real-time progress tracking
✓ Accuracy comparison (grand total, GST total, item count)
✓ Performance metrics (processing time)
✓ Side-by-side result inspection
✓ Popular model quick-select

When to Use

Model Selection

Determine which model works best for your invoice format

Accuracy Testing

Verify extraction accuracy across different models

Performance Benchmarking

Compare processing speeds for different models

Cost Optimization

Balance accuracy vs cost for your use case

Accessing the Tool

Navigate to /compare in your browser:

http://localhost:3000/compare

Or from the main page, use the “Compare Models” button (if available in UI).

How to Use

Select an Invoice

Upload an invoice image or PDF using the file picker or drag-and-drop.

The same invoice will be sent to all selected models for comparison.

Choose Models to Compare

Select 2-10 models from the list:Quick Select - Popular Models:

Google Gemini 2.5 Flash
Google Gemini 2.5 Pro
Google Gemini 2.0 Flash
OpenAI GPT-4o
OpenAI GPT-4o Mini
Anthropic Claude 3.5 Sonnet

Or browse all available vision models:

Click “Browse all models” to see the full catalog
Filter by provider, context length, or pricing

Start Comparison

Click “Run Comparison” to process the invoice with all selected models.Models run in parallel, displaying real-time status:

⏳ Pending (queued)
⚡ Running (processing)
✅ Success (completed)
❌ Error (failed)

Review Results

Compare the results side-by-side in a table view:Metrics Compared:

Grand Total Match: Whether computed total equals printed total
Item Count: Number of line items extracted
GST Total Match: Whether GST calculation matches printed GST
Error Absolute: Difference between computed and printed totals (₹)
Processing Time: Time taken to process (milliseconds)

Click on any result row to expand and view the full extracted data.

Result Interpretation

Accuracy Indicators

Perfect Match
Close Match
Significant Error

✅ Grand Total: Matched
✅ GST Total: Matched
Error: ₹0.00

Model extracted all data correctly and reconciliation succeeded.

⚠️ Grand Total: ₹0.50 off
✅ GST Total: Matched
Error: ₹0.50

Minor rounding difference, still acceptable (within ₹1 tolerance).

❌ Grand Total: ₹50.00 off
❌ GST Total: Mismatched
Error: ₹50.00

Model failed to extract correctly. Check the raw output for details.

Performance Metrics

Fast: < 3 seconds
Medium: 3-8 seconds
Slow: > 8 seconds

Processing time includes network latency to OpenRouter. First request may be slower due to cold start.

Popular Model Combinations

Budget Testing

Test cost-effective models:

openai/gpt-4o-mini
anthropic/claude-3-haiku
google/gemini-2.0-flash

Accuracy Validation

Compare premium models:

openai/gpt-4o
anthropic/claude-3.5-sonnet
google/gemini-2.5-pro

Speed vs Quality

Balance speed and accuracy:

google/gemini-2.0-flash (fast, accurate)
openai/gpt-4o-mini (fast, good)
openai/gpt-4o (slower, best)

Use Cases

1. Choosing Your Production Model

Scenario: Selecting a model for 1000 invoices/month

Goal: Balance accuracy and costSteps:

Upload a representative sample invoice
Compare 5 models across different price points:
- openai/gpt-4o-mini ($0.001/invoice)
- google/gemini-2.0-flash ($0.002/invoice)
- openai/gpt-4o ($0.020/invoice)
- anthropic/claude-3.5-sonnet ($0.025/invoice)
Review accuracy metrics
Calculate monthly cost: accuracy × 1000 invoices
Choose model with best accuracy/cost ratio

Example Result:

gemini-2.0-flash: 98% accuracy, $2/month → Selected
gpt-4o: 99% accuracy, $20/month → Too expensive for 1% gain

2. Debugging Extraction Issues

Scenario: Model extracts incorrect discount amount

Goal: Find which model handles this invoice type bestSteps:

Upload the problematic invoice
Compare 3-4 different models
Inspect which model correctly identifies the discount structure
Use that model for similar invoices going forward

Insight: Some models better understand regional invoice formats or complex discount structures.

3. Validating New Model Versions

Scenario: OpenRouter releases Gemini 2.5 Pro

Goal: Compare new model vs current production modelSteps:

Upload 5 representative invoices
Compare current model vs new model
Check if accuracy improved
Verify processing time is acceptable
Update OPENROUTER_MODEL if beneficial

Decision Criteria:

Accuracy improvement > 2%
Processing time increase < 50%
Cost increase < 100%

Understanding the Results Table

Model

string

Model name and ID

Status

enum

pending: Waiting to start
running: Currently processing
success: Completed successfully
error: Failed with error

Time

number

Processing duration in milliseconds

Items

number

Number of line items extracted

Grand Total

string

Whether computed total matches printed total

✅ “Matched” - Perfect match
⚠️ “₹X.XX off” - Close (within tolerance)
❌ “₹X.XX off” - Significant error

GST Total

string

Whether GST calculation is correct

Error

number

Absolute difference in rupees

Actions

button

View Details: Expand to see full JSON output
Copy JSON: Copy extraction result to clipboard

Technical Details

Implementation: app/compare/page.tsx API Calls:

Fetches available models from /api/models
Processes invoice via /api/ocr-structured-v4 for each selected model
Runs requests in parallel (concurrent processing)

State Management:

Real-time status updates for each model
Tracks start time and duration
Computes accuracy metrics from reconciliation results

Reconciliation Check:

const accuracy = {
  grandTotalMatched: Math.abs(data.reconciliation.error_absolute) <= 0.05,
  gstTotalMatched: data.totals.gst_total === data.printed.gst_total,
  itemCount: data.items.length,
  errorAbsolute: data.reconciliation.error_absolute
};

Tips & Best Practices

Select 3-5 models per test

More than 5 models can be overwhelming to compare. Focus on 3-5 targeted models.

Test with diverse invoices

Compare models on multiple invoice types:

Simple single-page invoices
Multi-page PDFs
Scanned images
Invoices with complex discounts

Consider context window

For multi-page PDFs, choose models with larger context windows (100k+ tokens).

Check for consistency

Run the same invoice multiple times to check if results are stable (models should be deterministic).

Document your findings

Keep a spreadsheet of model performance on different invoice types to inform future decisions.

Limitations

Concurrent Rate Limits: Running many models simultaneously may hit OpenRouter rate limits. Start with 3-4 models at a time.

Cost Consideration: Each comparison costs the same as processing N separate invoices (N = number of models). Use selectively.

Model Selection Guide

Learn about available models and their characteristics

GET /api/models

API endpoint for fetching available models

OCR Modes

Understand different extraction modes

Reconciliation Engine

How accuracy is measured

Get Started

Core Features

Guides

Configuration

Overview

When to Use

Model Selection

Accuracy Testing

Performance Benchmarking

Cost Optimization

Accessing the Tool

How to Use

Result Interpretation

Accuracy Indicators

Performance Metrics

Popular Model Combinations

Budget Testing

Accuracy Validation

Speed vs Quality

Use Cases

1. Choosing Your Production Model

2. Debugging Extraction Issues

3. Validating New Model Versions

Understanding the Results Table

Technical Details

Tips & Best Practices

Limitations

Model Selection Guide

GET /api/models

OCR Modes

Reconciliation Engine

Build docs developers (and LLMs) love

Get Started

Core Features

Guides

Configuration

​Overview

​When to Use

Model Selection

Accuracy Testing

Performance Benchmarking

Cost Optimization

​Accessing the Tool

​How to Use

​Result Interpretation

​Accuracy Indicators

​Performance Metrics

​Popular Model Combinations

​Budget Testing

​Accuracy Validation

​Speed vs Quality

​Use Cases

​1. Choosing Your Production Model

​2. Debugging Extraction Issues

​3. Validating New Model Versions

​Understanding the Results Table

​Technical Details

​Tips & Best Practices

​Limitations

​Related

Model Selection Guide

GET /api/models

OCR Modes

Reconciliation Engine

Build docs developers (and LLMs) love

Overview

When to Use

Accessing the Tool

How to Use

Result Interpretation

Accuracy Indicators

Performance Metrics

Popular Model Combinations

Budget Testing

Accuracy Validation

Speed vs Quality

Use Cases

1. Choosing Your Production Model

2. Debugging Extraction Issues

3. Validating New Model Versions

Understanding the Results Table

Technical Details

Tips & Best Practices

Limitations

Related