/compare) lets you process the same invoice with multiple vision models simultaneously and compare their accuracy, speed, and output quality.
Overview
Compare Page Features
- ✓ Run up to 10 models concurrently
- ✓ Real-time progress tracking
- ✓ Accuracy comparison (grand total, GST total, item count)
- ✓ Performance metrics (processing time)
- ✓ Side-by-side result inspection
- ✓ Popular model quick-select
When to Use
Model Selection
Determine which model works best for your invoice format
Accuracy Testing
Verify extraction accuracy across different models
Performance Benchmarking
Compare processing speeds for different models
Cost Optimization
Balance accuracy vs cost for your use case
Accessing the Tool
Navigate to/compare in your browser:
How to Use
Select an Invoice
Upload an invoice image or PDF using the file picker or drag-and-drop.
The same invoice will be sent to all selected models for comparison.
Choose Models to Compare
Select 2-10 models from the list:Quick Select - Popular Models:
- Google Gemini 2.5 Flash
- Google Gemini 2.5 Pro
- Google Gemini 2.0 Flash
- OpenAI GPT-4o
- OpenAI GPT-4o Mini
- Anthropic Claude 3.5 Sonnet
- Click “Browse all models” to see the full catalog
- Filter by provider, context length, or pricing
Start Comparison
Click “Run Comparison” to process the invoice with all selected models.Models run in parallel, displaying real-time status:
- ⏳ Pending (queued)
- ⚡ Running (processing)
- ✅ Success (completed)
- ❌ Error (failed)
Review Results
Compare the results side-by-side in a table view:Metrics Compared:
- Grand Total Match: Whether computed total equals printed total
- Item Count: Number of line items extracted
- GST Total Match: Whether GST calculation matches printed GST
- Error Absolute: Difference between computed and printed totals (₹)
- Processing Time: Time taken to process (milliseconds)
Result Interpretation
Accuracy Indicators
- Perfect Match
- Close Match
- Significant Error
Performance Metrics
- Fast: < 3 seconds
- Medium: 3-8 seconds
- Slow: > 8 seconds
Processing time includes network latency to OpenRouter. First request may be slower due to cold start.
Popular Model Combinations
Budget Testing
Test cost-effective models:openai/gpt-4o-minianthropic/claude-3-haikugoogle/gemini-2.0-flash
Accuracy Validation
Compare premium models:openai/gpt-4oanthropic/claude-3.5-sonnetgoogle/gemini-2.5-pro
Speed vs Quality
Balance speed and accuracy:google/gemini-2.0-flash(fast, accurate)openai/gpt-4o-mini(fast, good)openai/gpt-4o(slower, best)
Use Cases
1. Choosing Your Production Model
Scenario: Selecting a model for 1000 invoices/month
Scenario: Selecting a model for 1000 invoices/month
Goal: Balance accuracy and costSteps:
- Upload a representative sample invoice
- Compare 5 models across different price points:
openai/gpt-4o-mini($0.001/invoice)google/gemini-2.0-flash($0.002/invoice)openai/gpt-4o($0.020/invoice)anthropic/claude-3.5-sonnet($0.025/invoice)
- Review accuracy metrics
- Calculate monthly cost: accuracy × 1000 invoices
- Choose model with best accuracy/cost ratio
gemini-2.0-flash: 98% accuracy, $2/month → Selectedgpt-4o: 99% accuracy, $20/month → Too expensive for 1% gain
2. Debugging Extraction Issues
Scenario: Model extracts incorrect discount amount
Scenario: Model extracts incorrect discount amount
Goal: Find which model handles this invoice type bestSteps:
- Upload the problematic invoice
- Compare 3-4 different models
- Inspect which model correctly identifies the discount structure
- Use that model for similar invoices going forward
3. Validating New Model Versions
Scenario: OpenRouter releases Gemini 2.5 Pro
Scenario: OpenRouter releases Gemini 2.5 Pro
Goal: Compare new model vs current production modelSteps:
- Upload 5 representative invoices
- Compare current model vs new model
- Check if accuracy improved
- Verify processing time is acceptable
- Update
OPENROUTER_MODELif beneficial
- Accuracy improvement > 2%
- Processing time increase < 50%
- Cost increase < 100%
Understanding the Results Table
Model name and ID
pending: Waiting to startrunning: Currently processingsuccess: Completed successfullyerror: Failed with error
Processing duration in milliseconds
Number of line items extracted
Whether computed total matches printed total
- ✅ “Matched” - Perfect match
- ⚠️ “₹X.XX off” - Close (within tolerance)
- ❌ “₹X.XX off” - Significant error
Whether GST calculation is correct
Absolute difference in rupees
- View Details: Expand to see full JSON output
- Copy JSON: Copy extraction result to clipboard
Technical Details
Implementation:app/compare/page.tsx
API Calls:
- Fetches available models from
/api/models - Processes invoice via
/api/ocr-structured-v4for each selected model - Runs requests in parallel (concurrent processing)
- Real-time status updates for each model
- Tracks start time and duration
- Computes accuracy metrics from reconciliation results
Tips & Best Practices
Select 3-5 models per test
Select 3-5 models per test
More than 5 models can be overwhelming to compare. Focus on 3-5 targeted models.
Test with diverse invoices
Test with diverse invoices
Compare models on multiple invoice types:
- Simple single-page invoices
- Multi-page PDFs
- Scanned images
- Invoices with complex discounts
Consider context window
Consider context window
For multi-page PDFs, choose models with larger context windows (100k+ tokens).
Check for consistency
Check for consistency
Run the same invoice multiple times to check if results are stable (models should be deterministic).
Document your findings
Document your findings
Keep a spreadsheet of model performance on different invoice types to inform future decisions.
Limitations
Cost Consideration: Each comparison costs the same as processing N separate invoices (N = number of models). Use selectively.
Related
Model Selection Guide
Learn about available models and their characteristics
GET /api/models
API endpoint for fetching available models
OCR Modes
Understand different extraction modes
Reconciliation Engine
How accuracy is measured
