Overview
CheckThat AI implements a comprehensive fact-checking pipeline that transforms noisy social media posts into verified, normalized claims ready for fact-checking. The pipeline integrates multiple stages of processing, evaluation, and refinement.Pipeline Architecture
The complete fact-checking workflow consists of six interconnected stages:Stage 1: Input Processing
Noisy Social Media Posts
The pipeline begins with unstructured text from social media platforms:Typical Input Characteristics:
- Informal language and slang
- Grammatical errors and typos
- Hashtags, URLs, and mentions
- Emojis and special characters
- Incomplete sentences
- Emotional or inflammatory language
- Ambiguous references (“they”, “this”, “it”)
Stage 2: Claim Detection
The system identifies verifiable assertions within the noisy text. See Claim Detection for detailed methodology.Detection Process
- Sentence Segmentation: Split post into individual sentences
- Context Analysis: Establish relationships between sentences
- Verifiability Assessment: Identify which parts can be fact-checked
- Claim Extraction: Pull out factual assertions
Stage 3: Claim Normalization
Normalization Goals
Transform detected claims into standardized form:Normalization Objectives:
- Self-contained: Understandable without context
- Concise: Typically under 25 words
- Clear: Unambiguous language
- Verifiable: Can be fact-checked
- Faithful: Preserves original meaning
Multi-Model Approach
CheckThat AI supports multiple AI models for normalization:| Provider | Models | Use Case |
|---|---|---|
| OpenAI | GPT-4o, GPT-4.1 | High-quality reasoning and context understanding |
| Anthropic | Claude 3.7 Sonnet | Strong performance on ambiguity resolution |
| Gemini 2.5 Pro, Flash | Fast processing with good accuracy | |
| Meta | Llama 3.3 70B | Open-source alternative, free tier available |
| xAI | Grok 3 | Alternative perspective on claim interpretation |
Prompting Strategies
Different approaches for different scenarios:Zero-Shot
Direct instruction without examples:Few-Shot
Provide examples to guide the model (5 examples in production):Chain-of-Thought (CoT)
Step-by-step reasoning process:Stage 4: Quality Evaluation
Automated Assessment
The system evaluates normalized claims using multiple metrics:G-Eval Framework
LLM-based evaluation with specific criteria (see G-Eval):/home/daytona/workspace/source/api/services/refinement/refine.py:76-108
METEOR Score
When reference claims are available (competition datasets):Stage 5: Refinement Loop
Iterative Improvement
If the claim doesn’t meet quality thresholds, the system enters a refinement loop:Refinement Process (
RefinementService in refine.py:46-185):- Evaluate: Score the current claim
- Generate Feedback: Identify specific weaknesses
- Refine: Create improved version
- Re-evaluate: Check if quality improved
- Iterate: Repeat up to
max_iterstimes (default: 3) - Terminate: When threshold met or max iterations reached
Refinement Service Implementation
Self-Refine vs Cross-Refine
Self-Refine
Same model refines its own output:Cross-Refine
Different model refines another model’s output:Stage 6: Source Verification
Fact-Checking Integration
Once a high-quality normalized claim is produced, it’s ready for fact-checking:Verification Process (external to CheckThat AI):
- Source Discovery: Find relevant sources (news, scientific papers, databases)
- Evidence Extraction: Pull supporting or refuting evidence
- Credibility Assessment: Evaluate source reliability
- Verdict Determination: Conclude true/false/partially true/unverifiable
- Explanation Generation: Document reasoning and sources
Check-Worthiness Filtering
Before verification, assess if the claim warrants fact-checking:- Health/medical advice
- Election information
- Safety warnings
- Financial claims
- Policy announcements
- Personal anecdotes
- Clearly satirical content
- Subjective preferences
- Already debunked claims
Complete Pipeline Example
End-to-End Processing
Input:Integration with CheckThat AI
API Usage
Single Claim Processing
Batch Processing
Web Application
The CheckThat AI web interface provides:- Interactive Chat: Real-time claim normalization with streaming
- Batch Evaluation: Upload CSV datasets for bulk processing
- Model Comparison: Test multiple models simultaneously
- Refinement Tracking: Visualize improvement across iterations
- METEOR Scoring: Automatic quality assessment
Performance Considerations
Latency
| Stage | Typical Latency | Notes |
|---|---|---|
| Detection | ~500ms | Depends on post length |
| Normalization | 2-5s | Model-dependent |
| Evaluation | 3-8s | G-Eval requires LLM call |
| Refinement (per iteration) | 5-13s | Normalization + evaluation |
| Total | 10-30s | With 0-2 refinement iterations |
Cost Optimization
Cost-Saving Strategies:
- Free Tier Models: Use Llama 3.3 70B via Together.ai (no API key required)
- Selective Refinement: Only refine claims below threshold
- Early Stopping: Terminate refinement when score plateaus
- Batch Processing: Process multiple claims in parallel
- Caching: Store normalized claims to avoid reprocessing
Evaluation Metrics
System Performance
The complete pipeline is evaluated using:- METEOR Score: Measures semantic similarity to reference claims (competition metric)
- G-Eval Score: Assesses quality across multiple dimensions
- Human Evaluation: Expert judgment on claim quality
- Convergence Rate: Percentage of claims meeting threshold
- Iteration Efficiency: Average refinement iterations required
References
Implementation Files
- Refinement Service:
/api/services/refinement/refine.py - Evaluation Service:
/api/services/evaluation/evaluate.py - Prompts:
/api/_utils/prompts.py - DeepEval Integration:
/api/_utils/deepeval_model.py
Related Documentation
- CLEF-CheckThat! Lab Overview
- Claim Detection Methodology
- G-Eval Framework
- METEOR Scoring
- DeepEval Integration
Academic References
- CLEF-CheckThat! Lab proceedings and datasets
- G-Eval: “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment” (Liu et al., 2023)
- METEOR: “METEOR: An Automatic Metric for MT Evaluation” (Banerjee & Lavie, 2005)