Current Status: The batch evaluation feature is referenced in the codebase and planned for the web application. Based on the API structure, batch evaluation capabilities are being developed to support:
- Dataset upload (CSV/JSONL formats)
- Multi-model comparison
- Real-time progress tracking via WebSocket
- Comprehensive METEOR scoring and evaluation metrics
Overview
Batch Evaluation is designed for researchers and practitioners who need to:- Normalize large datasets of social media claims
- Compare multiple AI models on the same dataset
- Evaluate normalization quality with automated metrics
- Export results for further analysis
- Test different prompting strategies at scale
Supported Features
Dataset Formats
The system supports two primary dataset formats: CSV FormatRequired Fields
- id: Unique identifier for each claim
- claim: The noisy, unstructured social media post to normalize
- reference (optional): Ground truth normalization for evaluation metrics
Evaluation Metrics
The batch evaluation system uses multiple metrics to assess normalization quality: METEOR (Metric for Evaluation of Translation with Explicit ORdering)- Primary metric for claim normalization quality
- Measures semantic similarity between generated and reference normalizations
- Accounts for synonyms, stemming, and paraphrasing
- Score range: 0.0 (poor) to 1.0 (perfect match)
- Verifiability: How easily the normalized claim can be fact-checked
- Check-Worthiness: Importance and urgency of verifying this claim
- Factual Consistency: Accuracy without distortion or hallucination
- Clarity: Understandability and lack of ambiguity
- Relevance: Connection to current events and public discourse
Batch Evaluation Workflow
Prepare Your Dataset
Create a CSV or JSONL file with your claims:CSV Template:Best Practices:
- Include unique IDs for tracking
- Ensure claims are properly escaped (especially if they contain commas or quotes)
- Add reference normalizations for metric calculation
- Keep individual claims under 500 characters for optimal processing
Upload Dataset
Navigate to the Batch Evaluation section and upload your dataset file.File Requirements:
- Maximum file size: 10MB
- Supported formats:
.csv,.jsonl - UTF-8 encoding recommended
- Column headers required for CSV files
Configure Evaluation Settings
Select your evaluation parameters:Model Selection:
- Choose one or multiple AI models to compare
- Available models depend on your API key configuration
- Zero-shot: Direct normalization
- Few-shot: Example-based learning
- Chain-of-Thought: Step-by-step reasoning
- Self-Refine: Iterative improvement
- Cross-Refine: Multi-model collaborative refinement
- METEOR scoring (default)
- Verifiability assessment
- Check-worthiness scoring
- Factual consistency check
- Clarity evaluation
Monitor Progress
Track real-time progress via WebSocket connection:Progress Indicators:
- Claims processed:
15/100 - Current model:
GPT-4o-mini - Estimated time remaining
- Current claim being processed
The WebSocket connection provides live updates without requiring page refreshes. You can leave the page and return - progress is saved on the server.
Review Results
Once processing completes, review comprehensive results:Results Dashboard:
- Overall METEOR score
- Per-claim normalization results
- Metric breakdown by evaluation criteria
- Model comparison (if multiple models used)
- Detailed error reports for failed claims
Understanding Evaluation Metrics
METEOR Scoring
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is the primary quality metric: Score Interpretation:- 0.0 - 0.3: Poor normalization, significant semantic drift
- 0.3 - 0.5: Moderate quality, captures some key information
- 0.5 - 0.7: Good normalization, preserves main claims
- 0.7 - 0.9: Excellent normalization, high semantic similarity
- 0.9 - 1.0: Near-perfect match with reference
- Unigram matching between normalized and reference text
- Synonym and stemming alignment
- Word order preservation
- Penalty for fragmentation
DeepEval Metrics
The evaluation service uses G-Eval metrics for comprehensive assessment: Verifiability (0.0 - 1.0) Evaluates how easily the claim can be fact-checked:- Contains specific, factual assertions
- Evidence can be found to support/refute
- Not overly vague or opinion-based
- Includes relevant time/location context
- Potential harm if false
- Reach and influence potential
- Public interest level
- Impact on vulnerable populations
- No hallucinated information
- Maintains original context
- Doesn’t misrepresent source
- Avoids introducing new claims
- Clear, simple language
- Lacks ambiguous terms
- Self-contained statement
- Concise yet comprehensive
Model Comparison
When evaluating with multiple models, the results dashboard provides: Side-by-Side Comparison Table:| Claim | GPT-4o | Claude Opus | Gemini Pro | Best METEOR |
|---|---|---|---|---|
| ”OMG the president…“ | 0.85 | 0.88 | 0.82 | Claude Opus |
| ”Scientists say coffee…“ | 0.79 | 0.81 | 0.84 | Gemini Pro |
- Average METEOR score per model
- Consistency across claims
- Processing time per model
- Cost per model (based on token usage)
API Integration
For programmatic access to batch evaluation:Best Practices
Dataset Preparation
- Clean your data: Remove duplicates and invalid entries
- Balance your dataset: Include diverse claim types
- Add quality references: Better references = more accurate metrics
- Test with small batches: Validate with 10-20 claims first
Model Selection
- Start with mid-tier models: Test with GPT-4o-mini or Claude Sonnet
- Use premium for final runs: Reserve GPT-5/Opus 4.1 for production
- Compare 2-3 models: Find the best fit for your use case
- Consider cost vs quality: Balance budget with accuracy needs
Optimization
- Batch size: Keep batches under 1000 claims for manageable processing
- Concurrent processing: The system automatically parallelizes when possible
- Error handling: Failed claims are retried automatically (up to 3 attempts)
- Rate limits: Respect API rate limits by spacing requests
Troubleshooting
Upload Errors
“Invalid CSV format”- Ensure proper CSV structure with headers
- Check for unescaped quotes or commas
- Verify UTF-8 encoding
- Split dataset into multiple files (max 10MB each)
- Remove unnecessary columns
- Use JSONL for more efficient storage
Processing Issues
“Timeout on claim processing”- Extremely long claims may timeout
- Break very long posts into multiple claims
- Consider using a more powerful model
- Check if references are too different from your normalization style
- Verify references are actual normalizations, not original claims
- Consider if your prompting strategy needs adjustment
Next Steps
- Learn about [Claim Normalization Strategies(/guides/prompting-strategies)
- Explore [Evaluation Metrics(/concepts/evaluation-metrics) in detail
- Try Interactive Chat for single claim testing
- Review [API Documentation(/api/overview) for programmatic access