What is G-Eval?
G-Eval is a framework for evaluating natural language generation (NLG) outputs using large language models (LLMs) as evaluators. Unlike traditional metrics that rely on simple text matching, G-Eval leverages the reasoning capabilities of advanced AI models to assess quality across nuanced criteria.Research Background
Paper: “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment”Authors: Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang ZhuPublication: 2023Key Innovation: Using LLMs with chain-of-thought reasoning to evaluate text quality shows significantly higher correlation with human judgments than traditional metrics like BLEU or ROUGE.
Why G-Eval for Claim Normalization?
Traditional metrics struggle with claim evaluation:| Metric | Limitation for Claims |
|---|---|
| BLEU | Focuses on n-gram overlap; misses semantic meaning |
| ROUGE | Recall-oriented; doesn’t assess verifiability |
| Exact Match | Too strict; ignores valid paraphrasing |
| METEOR | Better but still primarily surface-level matching |
- Understands semantic meaning
- Evaluates nuanced criteria (verifiability, check-worthiness)
- Provides explanatory feedback
- Adapts to custom evaluation dimensions
G-Eval Framework Components
The G-Eval framework consists of three core elements:1. Evaluation Criteria
Human-readable descriptions of what to evaluate:2. Evaluation Steps
Detailed instructions guiding the LLM’s assessment:3. Evaluation Parameters
The inputs provided to the evaluator:EXPECTED_OUTPUT (reference claim), RETRIEVAL_CONTEXT (background info)
Implementation in CheckThat AI
DeepEval Integration
CheckThat AI uses the DeepEval library’s G-Eval implementation:/home/daytona/workspace/source/api/services/refinement/refine.py:76-108
Static Evaluation Specification
CheckThat AI defines standard criteria inSTATIC_EVAL_SPECS:
Scoring Methodology
Score Range
G-Eval produces scores from 0.0 to 1.0:- 0.9-1.0: Excellent quality, ready for fact-checking
- 0.8-0.89: High quality, minor improvements possible
- 0.7-0.79: Good quality, meets threshold
- 0.6-0.69: Acceptable but could be refined
- 0.5-0.59: Below threshold, refinement needed
- 0.0-0.49: Poor quality, significant issues
Threshold Configuration
CheckThat AI uses configurable thresholds:Feedback Generation
G-Eval provides explanatory reasoning:Evaluation Criteria in Detail
1. Verifiability and Self-Containment
Question: Can this claim be fact-checked without additional context?High Score Example:
“The FDA approved Pfizer’s COVID-19 vaccine for ages 5-11 on October 29, 2021”✅ Specific date, organization, and subject
✅ Can be verified through FDA records
✅ No external context needed
Low Score Example:
“They approved it last year”❌ Who is “they”?
❌ What is “it”?
❌ Which “last year”?
2. Claim Centrality and Extraction Quality
Question: Does this capture the main claim and remove noise? Original Post:3. Conciseness and Clarity
Question: Is the claim brief and easy to understand? Target: ≤ 25 words, single sentence| Claim | Word Count | Clarity |
|---|---|---|
| ”The COVID-19 pandemic began in Wuhan, China in late 2019” | 11 | ✅ Clear |
| ”It is widely believed by many scientists and researchers around the world that the origins of what we now call COVID-19 can be traced back to the city of Wuhan in China around the end of 2019” | 41 | ❌ Verbose |
4. Check-Worthiness Alignment
Question: Is this important enough to fact-check? High Check-Worthiness:- Health/medical claims
- Political statements
- Claims about public figures
- Safety information
- Financial advice
- Personal opinions
- Obvious satire
- Trivial facts
- Already well-established information
5. Factual Consistency
Question: Does the claim faithfully represent the source without hallucination?Hallucination Example:Original: “Some studies suggest coffee may reduce cancer risk”Hallucinated Claim: “Harvard Medical School confirms coffee prevents cancer”❌ Added false specificity (“Harvard Medical School”, “confirms”, “prevents”)
Multi-Metric Evaluation
Custom Evaluation Criteria
You can define domain-specific metrics:Combining Multiple Metrics
Evaluate claims across different dimensions:Refinement Integration
Feedback-Driven Refinement
G-Eval feedback guides claim improvement:/api/services/refinement/refine.py:116-133
Convergence Detection
Stop refinement when improvements plateau:DeepEval Model Support
CheckThat AI supports multiple evaluation models:Supported Providers
Model Selection
| Model | Best For | Cost |
|---|---|---|
| GPT-4o | Highest accuracy, nuanced evaluation | $$$ |
| Claude 3.7 Sonnet | Balanced accuracy and speed | $$ |
| Gemini 2.5 Flash | Fast evaluation, lower cost | $ |
| Grok 3 | Alternative perspective | $$ |
Performance Considerations
Latency
G-Eval evaluation requires LLM inference:- Single evaluation: 3-8 seconds
- With refinement (3 iterations): 15-40 seconds total
- Batch processing: Use parallel requests
Cost Optimization
Cost-Saving Strategies:
- Use smaller models for evaluation: Gemini Flash is 10x cheaper than GPT-4
- Cache evaluations: Store scores for identical inputs
- Selective refinement: Only refine claims below threshold
- Early stopping: Terminate when score improvement < 0.05
- Batch evaluation: Process multiple claims in parallel
Reliability
G-Eval is non-deterministic (LLM-based):Comparison with METEOR
Both metrics are used in CheckThat AI:| Aspect | G-Eval | METEOR |
|---|---|---|
| Type | LLM-based semantic evaluation | Statistical n-gram matching |
| Requires Reference | No | Yes |
| Evaluation Depth | Nuanced, multi-dimensional | Surface-level similarity |
| Speed | Slow (LLM inference) | Fast (algorithmic) |
| Cost | API costs | Free |
| Use Case | Development, refinement | Competition scoring |
| Feedback | Explanatory reasoning | Numeric score only |
Best Practices
Writing Effective Criteria
Good Criteria:
- Specific and measurable
- Aligned with task goals
- Understandable to LLMs
- Focused on observable properties
- Vague or subjective (“high quality”)
- Conflicting objectives
- Requires external knowledge
- Too many dimensions at once
Debugging Low Scores
If claims consistently score low:- Review criteria: Are they too strict?
- Check examples: Do high-scoring examples meet your needs?
- Inspect feedback: What specific issues are flagged?
- Adjust threshold: Is 0.5 appropriate for your use case?
- Test different models: Try Claude or Gemini as evaluators
References
Academic Papers
- G-Eval Paper: Liu et al., “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment” (2023)
- DeepEval: Confident AI’s evaluation framework documentation
- LLM-as-Judge: “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (Zheng et al., 2023)
Implementation Files
- Refinement Service:
/api/services/refinement/refine.py - Evaluation Specs:
/api/types/evals.py - DeepEval Wrapper:
/api/_utils/deepeval_model.py