Overview
CheckThat AI uses DeepEval’s G-Eval framework to assess the quality of normalized claims. Evaluation metrics provide objective scores (0.0-1.0) and detailed feedback to guide refinement and ensure claims meet fact-checking standards.The evaluation service is implemented in
api/services/evaluation/evaluate.py and integrates with the refinement pipeline to continuously improve claim quality.G-Eval: GPT-Based Evaluation
What is G-Eval?
G-Eval uses large language models (GPT, Claude, Gemini, etc.) as evaluators to score text against custom criteria. Unlike traditional metrics, G-Eval:- Understands context and nuance
- Provides detailed reasoning for scores
- Adapts to domain-specific evaluation needs
- Scales from 0.0 (poor) to 1.0 (excellent)
How G-Eval Works
Creating G-Eval Metrics
From api/services/evaluation/evaluate.py:24-105
Available Evaluation Metrics
CheckThat AI provides 5 built-in G-Eval metrics:1. Verifiability Assessment
Evaluates how easily a claim can be verified using reliable sources.Evaluation Steps
Evaluation Steps
- Check if the claim contains specific, factual assertions
- Assess whether evidence can be found to support or refute the claim
- Consider if the claim is time-sensitive or location-specific
- Determine if the claim requires expert knowledge to verify
Example Evaluation
Example Evaluation
Claim: “Gargling water can protect against coronavirus”Score: 0.75Reasoning: The claim is specific and testable through medical research. It can be verified by checking scientific literature on coronavirus transmission and gargling effectiveness. However, it lacks specificity about the type of coronavirus and gargling method, reducing precision.
2. Check-Worthiness Assessment
Evaluates the importance and urgency of fact-checking the claim.Evaluation Steps
Evaluation Steps
- Assess potential harm if the claim is false
- Consider the claim’s reach and influence potential
- Evaluate public interest in the claim’s veracity
- Determine if the claim could mislead vulnerable populations
Example Evaluation
Example Evaluation
Claim: “St.Austin University North Carolina says eating vaginal fluid makes you immune to cancer”Score: 0.95Reasoning: Extremely high check-worthiness. The claim could cause significant harm if false (cancer patients avoiding treatment), has high potential reach due to cancer’s prevalence, and could mislead vulnerable populations seeking alternative treatments. Urgent fact-checking required.
3. Factual Consistency Assessment
Evaluates if the claim accurately represents facts without distortion.Evaluation Steps
Evaluation Steps
- Check if the claim introduces new information not in the source
- Verify the claim doesn’t misrepresent the original context
- Ensure the claim maintains factual accuracy
- Confirm the claim doesn’t contain hallucinations
Example Evaluation
Example Evaluation
Claim: “Pakistani government appoints former army general to head medical regulatory body”Source: “Lieutenant Retired General Asif Mumtaz appointed as Chairman Pakistan Medical Commission PMC”Score: 0.88Reasoning: High factual consistency. The claim accurately represents the source material without adding unverified information. Minor generalization (“former army general” vs. specific name) is appropriate for normalization. No hallucinations or distortions detected.
4. Clarity Assessment
Evaluates how clear and understandable the claim is.Evaluation Steps
Evaluation Steps
- Check if the claim is written in clear, simple language
- Assess if the claim avoids ambiguous terms
- Determine if the claim is self-contained
- Evaluate if the claim is concise yet comprehensive
Example Evaluation
Example Evaluation
Claim: “Late actor and martial artist Bruce Lee playing table tennis with a set of nunchucks”Score: 0.82Reasoning: Good clarity with simple, descriptive language. The claim is self-contained and understandable without context. Minor reduction for present participle “playing” which could be clearer as “played” to indicate historical fact. Otherwise concise and comprehensive.
5. Relevance Assessment
Evaluates how relevant the claim is to current events or public discourse.Evaluation Steps
Evaluation Steps
- Assess if the claim addresses current issues
- Consider the claim’s impact on public opinion
- Evaluate the claim’s newsworthiness
- Determine if the claim affects policy or decision-making
Example Evaluation
Example Evaluation
Claim: “Drinking water at specific times can have different health benefits”Score: 0.58Reasoning: Moderate relevance. The claim addresses ongoing health and wellness discourse but is not tied to breaking news or urgent policy decisions. Has general public interest but low impact on critical decision-making. Somewhat evergreen content.
Metric Configuration
Custom Thresholds
Set minimum acceptable scores for each metric:Combining Multiple Metrics
From api/services/evaluation/evaluate.py:108-156
Evaluation Reports
Comprehensive evaluation results are returned in structured reports:From api/types/completions.py:30-37
Example Evaluation Report
Score Interpretation
Score Ranges
0.9 - 1.0: Excellent
Claim meets or exceeds all quality standards. Ready for professional fact-checking without modification.
0.7 - 0.89: Good
Claim meets most quality standards. Minor improvements may be beneficial but not required.
0.5 - 0.69: Acceptable
Claim meets minimum standards but has room for improvement. Consider refinement for critical applications.
0.0 - 0.49: Needs Improvement
Claim does not meet quality standards. Refinement strongly recommended before fact-checking.
Recommended Thresholds by Use Case
- High-Stakes Fact-Checking
- General Fact-Checking
- Claim Discovery
Threshold: 0.8+Use for:
- Health misinformation
- Political claims
- Financial advice
- Legal statements
Using Evaluation Metrics
API Request with Evaluation
Combined Evaluation + Refinement
Evaluation metrics automatically guide the refinement process:Model Selection for Evaluation
Different models excel at different evaluation tasks:GPT-4o
Best for: Balanced performance across all metricsExcellent reasoning, fast inference, cost-effective
Claude Opus 4.1
Best for: Nuanced evaluation, contextual understandingSuperior at detecting subtle issues, highest quality
Gemini 2.5 Pro
Best for: Factual consistency, verifiabilityStrong fact-checking capabilities, good reasoning
Grok 4
Best for: Real-time claims, current eventsAccess to up-to-date information, strong relevance assessment
Best Practices
Choose Appropriate Metrics
Select 2-4 metrics most relevant to your use case. More metrics = slower/costlier but more comprehensive.
Set Realistic Thresholds
Start with 0.6 and adjust based on your quality requirements. Too high = many false negatives.
Monitor Score Distributions
Track metric scores over time to identify systemic issues in claim normalization.
Next Steps
Refinement Pipeline
Learn how evaluation metrics drive iterative claim improvement
Supported Models
Choose the best model for your evaluation needs