Skip to main content
The Model Comparison tool (/compare) lets you process the same invoice with multiple vision models simultaneously and compare their accuracy, speed, and output quality.

Overview

Compare Page Features
  • ✓ Run up to 10 models concurrently
  • ✓ Real-time progress tracking
  • ✓ Accuracy comparison (grand total, GST total, item count)
  • ✓ Performance metrics (processing time)
  • ✓ Side-by-side result inspection
  • ✓ Popular model quick-select

When to Use

Model Selection

Determine which model works best for your invoice format

Accuracy Testing

Verify extraction accuracy across different models

Performance Benchmarking

Compare processing speeds for different models

Cost Optimization

Balance accuracy vs cost for your use case

Accessing the Tool

Navigate to /compare in your browser:
http://localhost:3000/compare
Or from the main page, use the “Compare Models” button (if available in UI).

How to Use

1

Select an Invoice

Upload an invoice image or PDF using the file picker or drag-and-drop.
The same invoice will be sent to all selected models for comparison.
2

Choose Models to Compare

Select 2-10 models from the list:Quick Select - Popular Models:
  • Google Gemini 2.5 Flash
  • Google Gemini 2.5 Pro
  • Google Gemini 2.0 Flash
  • OpenAI GPT-4o
  • OpenAI GPT-4o Mini
  • Anthropic Claude 3.5 Sonnet
Or browse all available vision models:
  • Click “Browse all models” to see the full catalog
  • Filter by provider, context length, or pricing
3

Start Comparison

Click “Run Comparison” to process the invoice with all selected models.Models run in parallel, displaying real-time status:
  • ⏳ Pending (queued)
  • ⚡ Running (processing)
  • ✅ Success (completed)
  • ❌ Error (failed)
4

Review Results

Compare the results side-by-side in a table view:Metrics Compared:
  • Grand Total Match: Whether computed total equals printed total
  • Item Count: Number of line items extracted
  • GST Total Match: Whether GST calculation matches printed GST
  • Error Absolute: Difference between computed and printed totals (₹)
  • Processing Time: Time taken to process (milliseconds)
Click on any result row to expand and view the full extracted data.

Result Interpretation

Accuracy Indicators

✅ Grand Total: Matched
✅ GST Total: Matched
Error: ₹0.00
Model extracted all data correctly and reconciliation succeeded.

Performance Metrics

  • Fast: < 3 seconds
  • Medium: 3-8 seconds
  • Slow: > 8 seconds
Processing time includes network latency to OpenRouter. First request may be slower due to cold start.

Budget Testing

Test cost-effective models:
  • openai/gpt-4o-mini
  • anthropic/claude-3-haiku
  • google/gemini-2.0-flash

Accuracy Validation

Compare premium models:
  • openai/gpt-4o
  • anthropic/claude-3.5-sonnet
  • google/gemini-2.5-pro

Speed vs Quality

Balance speed and accuracy:
  • google/gemini-2.0-flash (fast, accurate)
  • openai/gpt-4o-mini (fast, good)
  • openai/gpt-4o (slower, best)

Use Cases

1. Choosing Your Production Model

Goal: Balance accuracy and costSteps:
  1. Upload a representative sample invoice
  2. Compare 5 models across different price points:
    • openai/gpt-4o-mini ($0.001/invoice)
    • google/gemini-2.0-flash ($0.002/invoice)
    • openai/gpt-4o ($0.020/invoice)
    • anthropic/claude-3.5-sonnet ($0.025/invoice)
  3. Review accuracy metrics
  4. Calculate monthly cost: accuracy × 1000 invoices
  5. Choose model with best accuracy/cost ratio
Example Result:
  • gemini-2.0-flash: 98% accuracy, $2/month → Selected
  • gpt-4o: 99% accuracy, $20/month → Too expensive for 1% gain

2. Debugging Extraction Issues

Goal: Find which model handles this invoice type bestSteps:
  1. Upload the problematic invoice
  2. Compare 3-4 different models
  3. Inspect which model correctly identifies the discount structure
  4. Use that model for similar invoices going forward
Insight: Some models better understand regional invoice formats or complex discount structures.

3. Validating New Model Versions

Goal: Compare new model vs current production modelSteps:
  1. Upload 5 representative invoices
  2. Compare current model vs new model
  3. Check if accuracy improved
  4. Verify processing time is acceptable
  5. Update OPENROUTER_MODEL if beneficial
Decision Criteria:
  • Accuracy improvement > 2%
  • Processing time increase < 50%
  • Cost increase < 100%

Understanding the Results Table

Model
string
Model name and ID
Status
enum
  • pending: Waiting to start
  • running: Currently processing
  • success: Completed successfully
  • error: Failed with error
Time
number
Processing duration in milliseconds
Items
number
Number of line items extracted
Grand Total
string
Whether computed total matches printed total
  • ✅ “Matched” - Perfect match
  • ⚠️ “₹X.XX off” - Close (within tolerance)
  • ❌ “₹X.XX off” - Significant error
GST Total
string
Whether GST calculation is correct
Error
number
Absolute difference in rupees
Actions
button
  • View Details: Expand to see full JSON output
  • Copy JSON: Copy extraction result to clipboard

Technical Details

Implementation: app/compare/page.tsx API Calls:
  • Fetches available models from /api/models
  • Processes invoice via /api/ocr-structured-v4 for each selected model
  • Runs requests in parallel (concurrent processing)
State Management:
  • Real-time status updates for each model
  • Tracks start time and duration
  • Computes accuracy metrics from reconciliation results
Reconciliation Check:
const accuracy = {
  grandTotalMatched: Math.abs(data.reconciliation.error_absolute) <= 0.05,
  gstTotalMatched: data.totals.gst_total === data.printed.gst_total,
  itemCount: data.items.length,
  errorAbsolute: data.reconciliation.error_absolute
};

Tips & Best Practices

More than 5 models can be overwhelming to compare. Focus on 3-5 targeted models.
Compare models on multiple invoice types:
  • Simple single-page invoices
  • Multi-page PDFs
  • Scanned images
  • Invoices with complex discounts
For multi-page PDFs, choose models with larger context windows (100k+ tokens).
Run the same invoice multiple times to check if results are stable (models should be deterministic).
Keep a spreadsheet of model performance on different invoice types to inform future decisions.

Limitations

Concurrent Rate Limits: Running many models simultaneously may hit OpenRouter rate limits. Start with 3-4 models at a time.
Cost Consideration: Each comparison costs the same as processing N separate invoices (N = number of models). Use selectively.

Model Selection Guide

Learn about available models and their characteristics

GET /api/models

API endpoint for fetching available models

OCR Modes

Understand different extraction modes

Reconciliation Engine

How accuracy is measured

Build docs developers (and LLMs) love