What is METEOR?
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is an automatic evaluation metric originally designed for machine translation. In CheckThat AI, METEOR is used to measure the quality of normalized claims by comparing them against reference claims.Research Background
Paper: “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments”Authors: Satanjeev Banerjee and Alon LaviePublication: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and Summarization (2005)Key Innovation: Unlike BLEU which only considers precision, METEOR balances precision and recall, and incorporates stemming, synonymy, and paraphrasing.
Why METEOR for Claim Evaluation?
The CLEF-CheckThat! Lab uses METEOR as the primary evaluation metric for Task 2 (2025) because it aligns well with claim normalization requirements:Advantages for Claims
METEOR Benefits:
- Semantic Matching: Recognizes synonyms and paraphrases
- “automobile” ≈ “car”
- “physician” ≈ “doctor”
- Recall-Oriented: Rewards capturing all important content
- Doesn’t penalize adding necessary context
- Stemming: Handles morphological variations
- “running”, “runs”, “ran” → “run”
- Word Order: Considers phrase structure
- “The dog bit the man” ≠ “The man bit the dog”
- Balanced: Combines precision and recall harmonically
Comparison with Other Metrics
| Metric | Precision | Recall | Synonyms | Word Order | Best For |
|---|---|---|---|---|---|
| BLEU | ✅ | ❌ | ❌ | ✅ | Translation adequacy |
| ROUGE | ❌ | ✅ | ❌ | ❌ | Summarization coverage |
| Exact Match | ✅ | ✅ | ❌ | ✅ | Strict equality |
| METEOR | ✅ | ✅ | ✅ | ✅ | Semantic similarity |
| G-Eval | ✅ | ✅ | ✅ | ✅ | Nuanced evaluation |
METEOR Calculation
Algorithm Overview
METEOR computes alignment between candidate and reference text through multiple stages:Step 1: Alignment
METEOR creates alignment between words in stages:Stage 1: Exact Match
Stage 2: Stem Match
Stage 3: Synonym Match
Stage 4: Paraphrase Match
Step 2: Precision and Recall
After alignment:Step 3: F-mean
METEOR uses harmonic mean with recall weighted higher:Step 4: Fragmentation Penalty
Penalize discontinuous alignments:Score Interpretation
METEOR scores range from 0.0 to 1.0:Score Ranges
METEOR Score Interpretation:
- 0.90-1.00: Near-perfect match (rare in claim normalization)
- 0.80-0.89: Excellent semantic similarity
- 0.70-0.79: Good match with minor differences
- 0.60-0.69: Moderate similarity, key content preserved
- 0.50-0.59: Fair match, some important content differs
- 0.40-0.49: Poor match, significant differences
- 0.00-0.39: Very poor match, mostly unrelated
CLEF-CheckThat! Benchmarks
Competition baselines and winning systems:| System | Average METEOR | Performance |
|---|---|---|
| Human Reference | 1.000 | Perfect match |
| Winning System (2024) | ~0.750 | State-of-the-art |
| Strong Baseline | ~0.650 | Competitive |
| Simple Baseline | ~0.500 | Extractive summary |
| Random Sentence | ~0.200 | Poor |
Usage in CheckThat AI
Implementation
METEOR is used in the CLI evaluation tool:Dataset Evaluation
When evaluating against the CLEF-CheckThat! development set:When to Use METEOR vs G-Eval
METEOR:
- ✅ Competition evaluation (official metric)
- ✅ Benchmarking against datasets
- ✅ Fast, deterministic scoring
- ✅ Free (no API costs)
- ❌ Requires reference claims
- ❌ Surface-level matching only
- ✅ Development and refinement
- ✅ No reference needed
- ✅ Nuanced quality assessment
- ✅ Explanatory feedback
- ❌ Slow (LLM inference)
- ❌ API costs
- ❌ Non-deterministic
- G-Eval during iterative refinement
- METEOR for final evaluation against competition dataset
METEOR Limitations
Challenges for Claim Normalization
1. Surface-Level Matching
METEOR doesn’t understand deep semantics:2. Multiple Valid Normalizations
One post may have several correct normalizations:3. Length Bias
METEOR can favor longer or shorter claims:4. Synonym Coverage
WordNet doesn’t cover all domain-specific terms:Improving METEOR Scores
Strategies for Higher Scores
Optimization Techniques:
- Match reference length: Aim for similar word count
- Preserve key terms: Keep named entities and numbers exact
- Use common synonyms: WordNet-recognized alternatives
- Maintain word order: Follow reference structure when possible
- Avoid extra content: Don’t add information not in reference
- Use stemmed forms: “running” vs “ran” - keep consistent
Example Optimization
Original Post:METEOR in Competition Context
CLEF-CheckThat! Task 2 Evaluation
The official evaluation process:- Submit predictions for test set (unlabeled posts)
- Organizers compute METEOR against hidden reference claims
- Rankings determined by average METEOR across test set
- Human evaluation for top systems (secondary metric)
Submission Format
Scoring Script
Combining METEOR with G-Eval
Hybrid Evaluation Strategy
Use both metrics throughout development:Development Workflow
Technical Implementation
NLTK Integration
CheckThat AI uses NLTK’s METEOR implementation:Performance Optimization
References
Academic Literature
- Original METEOR Paper: Banerjee & Lavie, “METEOR: An Automatic Metric for MT Evaluation” (ACL 2005)
- METEOR Extensions: Denkowski & Lavie, “Meteor Universal” (2014)
- CLEF-CheckThat!: Annual task descriptions and evaluation reports
Implementation Resources
- NLTK Documentation: nltk.translate.meteor_score
- WordNet: Princeton WordNet
- ACL Anthology: Original paper