METEOR Scoring - CheckThat AI

What is METEOR?

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is an automatic evaluation metric originally designed for machine translation. In CheckThat AI, METEOR is used to measure the quality of normalized claims by comparing them against reference claims.

Research Background

Paper: “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments”Authors: Satanjeev Banerjee and Alon LaviePublication: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and Summarization (2005)Key Innovation: Unlike BLEU which only considers precision, METEOR balances precision and recall, and incorporates stemming, synonymy, and paraphrasing.

Why METEOR for Claim Evaluation?

The CLEF-CheckThat! Lab uses METEOR as the primary evaluation metric for Task 2 (2025) because it aligns well with claim normalization requirements:

Advantages for Claims

METEOR Benefits:

Semantic Matching: Recognizes synonyms and paraphrases
- “automobile” ≈ “car”
- “physician” ≈ “doctor”
Recall-Oriented: Rewards capturing all important content
- Doesn’t penalize adding necessary context
Stemming: Handles morphological variations
- “running”, “runs”, “ran” → “run”
Word Order: Considers phrase structure
- “The dog bit the man” ≠ “The man bit the dog”
Balanced: Combines precision and recall harmonically

Comparison with Other Metrics

Metric	Precision	Recall	Synonyms	Word Order	Best For
BLEU	✅	❌	❌	✅	Translation adequacy
ROUGE	❌	✅	❌	❌	Summarization coverage
Exact Match	✅	✅	❌	✅	Strict equality
METEOR	✅	✅	✅	✅	Semantic similarity
G-Eval	✅	✅	✅	✅	Nuanced evaluation

METEOR Calculation

Algorithm Overview

METEOR computes alignment between candidate and reference text through multiple stages:

from nltk.translate.meteor_score import meteor_score

# Example usage
reference = ["The FDA approved COVID-19 vaccine for children aged 5 to 11".split()]
candidate = "FDA approves COVID vaccine for kids ages 5-11".split()

score = meteor_score(reference, candidate)
print(f"METEOR Score: {score:.4f}")  # Range: 0.0 to 1.0

Step 1: Alignment

METEOR creates alignment between words in stages:

Stage 1: Exact Match

Reference: The FDA approved COVID-19 vaccine for children aged 5 to 11
Candidate: FDA approves COVID vaccine for kids ages 5-11
           ^^^ ^^^^^^^ ^^^^^ ^^^^^^^ ^^^ ^^^^ ^^ ^^ ^^
Exact:     FDA         COVID vaccine            5    11

Stage 2: Stem Match

Reference: approved
Candidate: approves
Stem:      approv* → MATCH

Stage 3: Synonym Match

Reference: children
Candidate: kids
WordNet:   children ≈ kids → MATCH

Stage 4: Paraphrase Match

Reference: aged 5 to 11
Candidate: ages 5-11
Paraphrase: Similar phrasing → MATCH

Step 2: Precision and Recall

After alignment:

# Count matched words
matches = count_aligned_words()

# Precision: What portion of candidate words matched?
precision = matches / len(candidate_words)

# Recall: What portion of reference words matched?
recall = matches / len(reference_words)

Example:

Reference: "The FDA approved COVID-19 vaccine for children aged 5 to 11" (11 words)
Candidate: "FDA approves COVID vaccine for kids ages 5-11" (9 words)
Matches: 9 (including stems/synonyms)

Precision = 9 / 9 = 1.00 (all candidate words matched)
Recall = 9 / 11 = 0.82 (missed "The" and "to")

Step 3: F-mean

METEOR uses harmonic mean with recall weighted higher:

# METEOR weights recall more heavily
alpha = 0.9  # Standard value

# Harmonic mean
if precision > 0 and recall > 0:
    f_mean = (10 * precision * recall) / (recall + alpha * precision)
else:
    f_mean = 0

Step 4: Fragmentation Penalty

Penalize discontinuous alignments:

Good alignment (continuous):
Reference: The cat sat on the mat
Candidate: cat sat on mat
           ^^^^^^^^^^^^^^^^^ (1 chunk)

Bad alignment (fragmented):
Reference: The cat sat on the mat  
Candidate: mat on sat cat
           ^^^ ^^ ^^^ ^^^ (4 chunks)

# Count alignment chunks
chunks = count_contiguous_alignment_chunks()

# Calculate penalty
fragmentation = chunks / matches
penalty = 0.5 * (fragmentation ** 3)

# Final score
meteor_score = f_mean * (1 - penalty)

Score Interpretation

METEOR scores range from 0.0 to 1.0:

Score Ranges

METEOR Score Interpretation:

0.90-1.00: Near-perfect match (rare in claim normalization)
0.80-0.89: Excellent semantic similarity
0.70-0.79: Good match with minor differences
0.60-0.69: Moderate similarity, key content preserved
0.50-0.59: Fair match, some important content differs
0.40-0.49: Poor match, significant differences
0.00-0.39: Very poor match, mostly unrelated

CLEF-CheckThat! Benchmarks

Competition baselines and winning systems:

System	Average METEOR	Performance
Human Reference	1.000	Perfect match
Winning System (2024)	~0.750	State-of-the-art
Strong Baseline	~0.650	Competitive
Simple Baseline	~0.500	Extractive summary
Random Sentence	~0.200	Poor

Usage in CheckThat AI

Implementation

METEOR is used in the CLI evaluation tool:

# From /home/daytona/workspace/source/claim_norm.py:45-47

METEOR_SCORE = start_extraction(
    MODEL, 
    PROMPT_STYLE, 
    DEV_DATA[0:1], 
    REFINE_ITERATIONS, 
    CROSS_REFINE_MODEL
)

print(f"\nAverage METEOR Score: {METEOR_SCORE}")

Dataset Evaluation

When evaluating against the CLEF-CheckThat! development set:

import pandas as pd
from nltk.translate.meteor_score import meteor_score

# Load dataset with reference claims
dev_data = pd.read_csv("data/dev.csv")

meteor_scores = []
for idx, row in dev_data.iterrows():
    original_post = row['original_post']
    reference_claim = row['normalized_claim']
    
    # Generate candidate claim
    candidate_claim = model.normalize(original_post)
    
    # Calculate METEOR
    score = meteor_score(
        [reference_claim.split()],
        candidate_claim.split()
    )
    
    meteor_scores.append(score)

# Report results
average_meteor = sum(meteor_scores) / len(meteor_scores)
print(f"Average METEOR Score: {average_meteor:.4f}")

When to Use METEOR vs G-Eval

METEOR:

✅ Competition evaluation (official metric)
✅ Benchmarking against datasets
✅ Fast, deterministic scoring
✅ Free (no API costs)
❌ Requires reference claims
❌ Surface-level matching only

G-Eval:

✅ Development and refinement
✅ No reference needed
✅ Nuanced quality assessment
✅ Explanatory feedback
❌ Slow (LLM inference)
❌ API costs
❌ Non-deterministic

Recommended Approach: Use both complementarily:

G-Eval during iterative refinement
METEOR for final evaluation against competition dataset

METEOR Limitations

Challenges for Claim Normalization

1. Surface-Level Matching

METEOR doesn’t understand deep semantics:

Reference: "COVID-19 vaccines are safe and effective"
Candidate: "COVID-19 vaccines pose serious health risks"

METEOR: ~0.65 (high surface similarity)
Semantic: Opposite meaning!

2. Multiple Valid Normalizations

One post may have several correct normalizations:

Post: "Elon Musk's company SpaceX launched 60 satellites yesterday"

Valid Claim 1: "SpaceX launched 60 satellites"
Valid Claim 2: "Elon Musk's company launched 60 satellites"
Valid Claim 3: "60 SpaceX satellites were launched"

METEOR only compares against one reference!

3. Length Bias

METEOR can favor longer or shorter claims:

Reference: "The FDA approved Pfizer vaccine" (5 words)

Candidate A: "FDA approved Pfizer vaccine" (4 words)
METEOR: 0.95 (high - very similar)

Candidate B: "The FDA regulatory agency approved the Pfizer 
               COVID-19 vaccine" (9 words)
METEOR: 0.72 (lower - added words penalized)

Both could be valid normalizations!

4. Synonym Coverage

WordNet doesn’t cover all domain-specific terms:

Reference: "COVID-19"
Candidate: "coronavirus"

WordNet: No synonym relationship
METEOR: 0.0 (no match)

Human judgment: Clearly related!

Improving METEOR Scores

Strategies for Higher Scores

Optimization Techniques:

Match reference length: Aim for similar word count
Preserve key terms: Keep named entities and numbers exact
Use common synonyms: WordNet-recognized alternatives
Maintain word order: Follow reference structure when possible
Avoid extra content: Don’t add information not in reference
Use stemmed forms: “running” vs “ran” - keep consistent

Example Optimization

Original Post:

BREAKING: The World Health Organization just announced that they 
are investigating a new variant of concern that has emerged in 
South Africa!!! 😱 This could change everything! #COVID #Variant

Reference Claim (competition dataset):

"WHO is investigating a new COVID-19 variant in South Africa"

Low METEOR Candidate (0.45):

"The global health authority is examining an emerging strain 
of the coronavirus disease discovered in a southern African nation"

Issues: Different words, longer, complex phrasing High METEOR Candidate (0.89):

"The WHO is investigating a new COVID-19 variant in South Africa"

Success: Similar words, same length, preserved key terms

METEOR in Competition Context

CLEF-CheckThat! Task 2 Evaluation

The official evaluation process:

Submit predictions for test set (unlabeled posts)
Organizers compute METEOR against hidden reference claims
Rankings determined by average METEOR across test set
Human evaluation for top systems (secondary metric)

Submission Format

post_id,normalized_claim
001,"WHO is investigating a new COVID-19 variant in South Africa"
002,"FDA approved Pfizer vaccine for children aged 5 to 11"
003,"Studies show coffee consumption linked to reduced cancer risk"
...

Scoring Script

# Simplified version of competition scoring

import pandas as pd
from nltk.translate.meteor_score import meteor_score

def evaluate_submission(predictions_file, references_file):
    """
    Evaluate competition submission using METEOR.
    """
    predictions = pd.read_csv(predictions_file)
    references = pd.read_csv(references_file)
    
    scores = []
    for pred_row, ref_row in zip(predictions.iterrows(), 
                                   references.iterrows()):
        pred_claim = pred_row[1]['normalized_claim']
        ref_claim = ref_row[1]['normalized_claim']
        
        score = meteor_score(
            [ref_claim.split()],
            pred_claim.split()
        )
        scores.append(score)
    
    return {
        'average_meteor': sum(scores) / len(scores),
        'median_meteor': sorted(scores)[len(scores)//2],
        'scores_per_post': scores
    }

results = evaluate_submission(
    'predictions.csv',
    'references.csv'
)

print(f"Average METEOR: {results['average_meteor']:.4f}")

Combining METEOR with G-Eval

Hybrid Evaluation Strategy

Use both metrics throughout development:

def evaluate_claim_quality(
    original_post: str,
    normalized_claim: str,
    reference_claim: Optional[str] = None
) -> Dict[str, Any]:
    """
    Comprehensive evaluation using both METEOR and G-Eval.
    """
    results = {}
    
    # G-Eval: Nuanced quality assessment
    geval_metric = GEval(
        criteria=STATIC_EVAL_SPECS.criteria,
        evaluation_steps=STATIC_EVAL_SPECS.evaluation_steps,
        model=eval_model,
        threshold=0.5
    )
    
    test_case = LLMTestCase(
        input=original_post,
        actual_output=normalized_claim
    )
    
    geval_metric.measure(test_case)
    results['geval_score'] = geval_metric.score
    results['geval_feedback'] = geval_metric.reason
    
    # METEOR: Reference-based similarity
    if reference_claim:
        meteor = meteor_score(
            [reference_claim.split()],
            normalized_claim.split()
        )
        results['meteor_score'] = meteor
    
    # Combined decision
    results['overall_quality'] = 'HIGH' if (
        results['geval_score'] >= 0.7 and 
        results.get('meteor_score', 1.0) >= 0.6
    ) else 'NEEDS_REFINEMENT'
    
    return results

Development Workflow

Technical Implementation

NLTK Integration

CheckThat AI uses NLTK’s METEOR implementation:

# Installation
pip install nltk

# Download required data
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"

from nltk.translate.meteor_score import meteor_score
from nltk.translate.meteor_score import single_meteor_score

# Single reference
score = meteor_score(
    [reference.split()],  # List of references (usually 1)
    candidate.split()      # Candidate claim
)

# Multiple references (rare in competition)
scores = [
    meteor_score([ref.split()], candidate.split())
    for ref in references
]
best_score = max(scores)

Performance Optimization

# Batch processing for speed
import multiprocessing as mp

def compute_meteor(args):
    reference, candidate = args
    return meteor_score([reference.split()], candidate.split())

# Parallel computation
with mp.Pool(processes=8) as pool:
    scores = pool.map(
        compute_meteor,
        zip(references, candidates)
    )

average_meteor = sum(scores) / len(scores)

References

Academic Literature

Original METEOR Paper: Banerjee & Lavie, “METEOR: An Automatic Metric for MT Evaluation” (ACL 2005)
METEOR Extensions: Denkowski & Lavie, “Meteor Universal” (2014)
CLEF-CheckThat!: Annual task descriptions and evaluation reports

Implementation Resources

NLTK Documentation: nltk.translate.meteor_score
WordNet: Princeton WordNet
ACL Anthology: Original paper

Background

Methodology

​What is METEOR?

​Research Background

​Why METEOR for Claim Evaluation?

​Advantages for Claims

​Comparison with Other Metrics

​METEOR Calculation

​Algorithm Overview

​Step 1: Alignment

​Stage 1: Exact Match

​Stage 2: Stem Match

​Stage 3: Synonym Match

​Stage 4: Paraphrase Match

​Step 2: Precision and Recall

​Step 3: F-mean

​Step 4: Fragmentation Penalty

​Score Interpretation

​Score Ranges

​CLEF-CheckThat! Benchmarks

​Usage in CheckThat AI

​Implementation

​Dataset Evaluation

​When to Use METEOR vs G-Eval

​METEOR Limitations

​Challenges for Claim Normalization

​1. Surface-Level Matching

​2. Multiple Valid Normalizations

​3. Length Bias

​4. Synonym Coverage

​Improving METEOR Scores

​Strategies for Higher Scores

​Example Optimization

​METEOR in Competition Context

​CLEF-CheckThat! Task 2 Evaluation

​Submission Format

​Scoring Script

​Combining METEOR with G-Eval

​Hybrid Evaluation Strategy

​Development Workflow

​Technical Implementation

​NLTK Integration

​Performance Optimization

​References

​Academic Literature

​Implementation Resources

​Related Documentation

Build docs developers (and LLMs) love

What is METEOR?

Research Background

Why METEOR for Claim Evaluation?

Advantages for Claims

Comparison with Other Metrics

METEOR Calculation

Algorithm Overview

Step 1: Alignment

Stage 1: Exact Match

Stage 2: Stem Match

Stage 3: Synonym Match

Stage 4: Paraphrase Match

Step 2: Precision and Recall

Step 3: F-mean

Step 4: Fragmentation Penalty

Score Interpretation

Score Ranges

CLEF-CheckThat! Benchmarks

Usage in CheckThat AI

Implementation

Dataset Evaluation

When to Use METEOR vs G-Eval

METEOR Limitations

Challenges for Claim Normalization

1. Surface-Level Matching

2. Multiple Valid Normalizations

3. Length Bias

4. Synonym Coverage

Improving METEOR Scores

Strategies for Higher Scores

Example Optimization

METEOR in Competition Context

CLEF-CheckThat! Task 2 Evaluation

Submission Format

Scoring Script

Combining METEOR with G-Eval

Hybrid Evaluation Strategy

Development Workflow

Technical Implementation

NLTK Integration

Performance Optimization

References

Academic Literature

Implementation Resources

Related Documentation