Evaluation Metrics - CheckThat AI

Overview

CheckThat AI uses DeepEval’s G-Eval framework to assess the quality of normalized claims. Evaluation metrics provide objective scores (0.0-1.0) and detailed feedback to guide refinement and ensure claims meet fact-checking standards.

The evaluation service is implemented in api/services/evaluation/evaluate.py and integrates with the refinement pipeline to continuously improve claim quality.

G-Eval: GPT-Based Evaluation

What is G-Eval?

G-Eval uses large language models (GPT, Claude, Gemini, etc.) as evaluators to score text against custom criteria. Unlike traditional metrics, G-Eval:

Understands context and nuance
Provides detailed reasoning for scores
Adapts to domain-specific evaluation needs
Scales from 0.0 (poor) to 1.0 (excellent)

How G-Eval Works

Creating G-Eval Metrics

From api/services/evaluation/evaluate.py:24-105

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

def create_evaluation_metrics(
    metric_types: List[str], 
    deepeval_model: Any
) -> Dict[str, GEval]:
    """
    Create G-Eval metrics based on requested evaluation types.
    """
    metric_definitions = {
        "verifiability": {
            "name": "Verifiability Assessment",
            "criteria": "Evaluate how easily this claim can be verified",
            "evaluation_steps": [
                "Check if the claim contains specific, factual assertions",
                "Assess whether evidence can be found to support or refute",
                "Consider if the claim is time-sensitive or location-specific",
                "Determine if the claim requires expert knowledge"
            ]
        },
        # ... more metrics
    }
    
    metrics = {}
    for metric_type in metric_types:
        metrics[metric_type] = GEval(
            name=metric_definitions[metric_type]["name"],
            criteria=metric_definitions[metric_type]["criteria"],
            evaluation_steps=metric_definitions[metric_type]["evaluation_steps"],
            evaluation_params=[LLMTestCaseParams.INPUT, 
                              LLMTestCaseParams.ACTUAL_OUTPUT],
            model=deepeval_model,
            threshold=0.5
        )
    return metrics

Available Evaluation Metrics

CheckThat AI provides 5 built-in G-Eval metrics:

1. Verifiability Assessment

Evaluates how easily a claim can be verified using reliable sources.

Evaluation Steps

Check if the claim contains specific, factual assertions
Assess whether evidence can be found to support or refute the claim
Consider if the claim is time-sensitive or location-specific
Determine if the claim requires expert knowledge to verify

Example Evaluation

Claim: “Gargling water can protect against coronavirus”Score: 0.75Reasoning: The claim is specific and testable through medical research. It can be verified by checking scientific literature on coronavirus transmission and gargling effectiveness. However, it lacks specificity about the type of coronavirus and gargling method, reducing precision.

2. Check-Worthiness Assessment

Evaluates the importance and urgency of fact-checking the claim.

Evaluation Steps

Assess potential harm if the claim is false
Consider the claim’s reach and influence potential
Evaluate public interest in the claim’s veracity
Determine if the claim could mislead vulnerable populations

Example Evaluation

Claim: “St.Austin University North Carolina says eating vaginal fluid makes you immune to cancer”Score: 0.95Reasoning: Extremely high check-worthiness. The claim could cause significant harm if false (cancer patients avoiding treatment), has high potential reach due to cancer’s prevalence, and could mislead vulnerable populations seeking alternative treatments. Urgent fact-checking required.

3. Factual Consistency Assessment

Evaluates if the claim accurately represents facts without distortion.

Evaluation Steps

Check if the claim introduces new information not in the source
Verify the claim doesn’t misrepresent the original context
Ensure the claim maintains factual accuracy
Confirm the claim doesn’t contain hallucinations

Example Evaluation

Claim: “Pakistani government appoints former army general to head medical regulatory body”Source: “Lieutenant Retired General Asif Mumtaz appointed as Chairman Pakistan Medical Commission PMC”Score: 0.88Reasoning: High factual consistency. The claim accurately represents the source material without adding unverified information. Minor generalization (“former army general” vs. specific name) is appropriate for normalization. No hallucinations or distortions detected.

4. Clarity Assessment

Evaluates how clear and understandable the claim is.

Evaluation Steps

Check if the claim is written in clear, simple language
Assess if the claim avoids ambiguous terms
Determine if the claim is self-contained
Evaluate if the claim is concise yet comprehensive

Example Evaluation

Claim: “Late actor and martial artist Bruce Lee playing table tennis with a set of nunchucks”Score: 0.82Reasoning: Good clarity with simple, descriptive language. The claim is self-contained and understandable without context. Minor reduction for present participle “playing” which could be clearer as “played” to indicate historical fact. Otherwise concise and comprehensive.

5. Relevance Assessment

Evaluates how relevant the claim is to current events or public discourse.

Evaluation Steps

Assess if the claim addresses current issues
Consider the claim’s impact on public opinion
Evaluate the claim’s newsworthiness
Determine if the claim affects policy or decision-making

Example Evaluation

Claim: “Drinking water at specific times can have different health benefits”Score: 0.58Reasoning: Moderate relevance. The claim addresses ongoing health and wellness discourse but is not tied to breaking news or urgent policy decisions. Has general public interest but low impact on critical decision-making. Somewhat evergreen content.

Metric Configuration

Custom Thresholds

Set minimum acceptable scores for each metric:

eval_metric = GEval(
    name="Verifiability Assessment",
    criteria="Evaluate verifiability...",
    evaluation_steps=[...],
    model=deepeval_model,
    threshold=0.7  # Require 0.7+ score to pass
)

# After evaluation
test_case = LLMTestCase(input=query, actual_output=claim)
eval_metric.measure(test_case)

if eval_metric.score >= eval_metric.threshold:
    print("Claim passed!")
else:
    print(f"Claim failed: {eval_metric.score} < {eval_metric.threshold}")

Combining Multiple Metrics

From api/services/evaluation/evaluate.py:108-156

def evaluate_text_with_metrics(
    text: str,
    metrics: Dict[str, GEval]
) -> Dict[str, Dict[str, Any]]:
    """
    Evaluate text using provided metrics.
    """
    results = {}
    
    for metric_name, metric in metrics.items():
        try:
            # Create test case
            test_case = LLMTestCase(
                input=f"Evaluate this text: {text}",
                actual_output=text,
            )
            
            # Measure
            metric.measure(test_case)
            
            # Store results
            results[metric_name] = {
                "score": metric.score,
                "reasoning": getattr(metric, 'reasoning', ''),
                "threshold": metric.threshold,
                "passed": metric.score >= metric.threshold
            }
        except Exception as e:
            results[metric_name] = {
                "score": 0.0,
                "error": str(e),
                "passed": False
            }
    
    return results

Evaluation Reports

Comprehensive evaluation results are returned in structured reports:

From api/types/completions.py:30-37

class EvaluationReport(BaseModel):
    """Evaluation report for post-normalization quality audits."""
    metrics_used: List[str]
    scores: Dict[str, float]  # metric_name -> score (0.0-1.0)
    detailed_results: Dict[str, Dict[str, Any]]
    timestamp: str  # ISO format
    report_url: Optional[str]  # Cloud storage URL if saved
    model_info: Optional[Dict[str, Any]]

Example Evaluation Report

{
  "evaluation_report": {
    "metrics_used": [
      "verifiability",
      "check_worthiness",
      "factual_consistency",
      "clarity"
    ],
    "scores": {
      "verifiability": 0.85,
      "check_worthiness": 0.92,
      "factual_consistency": 0.88,
      "clarity": 0.79
    },
    "detailed_results": {
      "verifiability": {
        "score": 0.85,
        "reasoning": "Claim contains specific factual assertions that can be verified through medical research databases. Time-sensitivity is clear (coronavirus pandemic context).",
        "threshold": 0.5,
        "passed": true
      },
      "check_worthiness": {
        "score": 0.92,
        "reasoning": "High potential for harm if false - health misinformation during pandemic. Wide reach expected due to public health concern. Vulnerable populations at risk.",
        "threshold": 0.5,
        "passed": true
      }
    },
    "timestamp": "2025-03-04T14:32:18.123456",
    "model_info": {
      "model_name": "gpt-4o",
      "evaluation_model": "gpt-4o"
    }
  }
}

Score Interpretation

Score Ranges

0.9 - 1.0: Excellent

Claim meets or exceeds all quality standards. Ready for professional fact-checking without modification.

0.7 - 0.89: Good

Claim meets most quality standards. Minor improvements may be beneficial but not required.

0.5 - 0.69: Acceptable

Claim meets minimum standards but has room for improvement. Consider refinement for critical applications.

0.0 - 0.49: Needs Improvement

Claim does not meet quality standards. Refinement strongly recommended before fact-checking.

Recommended Thresholds by Use Case

High-Stakes Fact-Checking
General Fact-Checking
Claim Discovery

Threshold: 0.8+Use for:

Health misinformation
Political claims
Financial advice
Legal statements

Requires highest quality claims with excellent verifiability and minimal ambiguity.

Using Evaluation Metrics

API Request with Evaluation

import openai

client = openai.OpenAI(
    base_url="https://api.checkthat.ai/v1",
    api_key="your-checkthat-api-key"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Some health post..."}
    ],
    extra_body={
        "evaluate_claims": True,
        "evaluation_metrics": [
            "verifiability",
            "check_worthiness",
            "factual_consistency"
        ],
        "evaluation_model": "gpt-4o"
    }
)

# Access evaluation results
eval_report = response.evaluation_report
print(f"Verifiability: {eval_report.scores['verifiability']}")
print(f"Check-Worthiness: {eval_report.scores['check_worthiness']}")

Evaluation metrics automatically guide the refinement process:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Health claim..."}],
    extra_body={
        # Enable both evaluation and refinement
        "evaluate_claims": True,
        "refine_claims": True,
        "refine_threshold": 0.7,  # Target score
        "refine_max_iters": 3,
        # Metrics guide refinement
        "evaluation_metrics": ["verifiability", "clarity"]
    }
)

# View refinement history with scores
for entry in response.refinement_metadata.refinement_history:
    print(f"{entry.claim_type}: {entry.score} - {entry.claim}")

Model Selection for Evaluation

Different models excel at different evaluation tasks:

GPT-4o

Best for: Balanced performance across all metricsExcellent reasoning, fast inference, cost-effective

Claude Opus 4.1

Best for: Nuanced evaluation, contextual understandingSuperior at detecting subtle issues, highest quality

Gemini 2.5 Pro

Best for: Factual consistency, verifiabilityStrong fact-checking capabilities, good reasoning

Grok 4

Best for: Real-time claims, current eventsAccess to up-to-date information, strong relevance assessment

For production applications, use GPT-4o or Claude Sonnet 4 for the best balance of quality, speed, and cost. Reserve premium models like Claude Opus for high-stakes evaluations.

Best Practices

Choose Appropriate Metrics

Select 2-4 metrics most relevant to your use case. More metrics = slower/costlier but more comprehensive.

Set Realistic Thresholds

Start with 0.6 and adjust based on your quality requirements. Too high = many false negatives.

Combine with Refinement

Use evaluation metrics to guide automatic refinement for best results.

Monitor Score Distributions

Track metric scores over time to identify systemic issues in claim normalization.

Validate with Human Review

Periodically compare metric scores with human judgments to ensure alignment.

Next Steps

Refinement Pipeline

Learn how evaluation metrics drive iterative claim improvement

Supported Models

Choose the best model for your evaluation needs

Get Started

Core Concepts

Web Application

Guides

Deployment

​Overview

​G-Eval: GPT-Based Evaluation

​What is G-Eval?

​How G-Eval Works

​Creating G-Eval Metrics

​Available Evaluation Metrics

​1. Verifiability Assessment

​2. Check-Worthiness Assessment

​3. Factual Consistency Assessment

​4. Clarity Assessment

​5. Relevance Assessment

​Metric Configuration

​Custom Thresholds

​Combining Multiple Metrics

​Evaluation Reports

​Example Evaluation Report

​Score Interpretation

​Score Ranges

0.9 - 1.0: Excellent

0.7 - 0.89: Good

0.5 - 0.69: Acceptable

0.0 - 0.49: Needs Improvement

​Recommended Thresholds by Use Case

​Using Evaluation Metrics

​API Request with Evaluation

​Combined Evaluation + Refinement

​Model Selection for Evaluation

GPT-4o

Claude Opus 4.1

Gemini 2.5 Pro

Grok 4

​Best Practices

​Next Steps

Refinement Pipeline

Supported Models

Build docs developers (and LLMs) love

Overview

G-Eval: GPT-Based Evaluation

What is G-Eval?

How G-Eval Works

Creating G-Eval Metrics

Available Evaluation Metrics

1. Verifiability Assessment

2. Check-Worthiness Assessment

3. Factual Consistency Assessment

4. Clarity Assessment

5. Relevance Assessment

Metric Configuration

Custom Thresholds

Combining Multiple Metrics

Evaluation Reports

Example Evaluation Report

Score Interpretation

Score Ranges

Recommended Thresholds by Use Case

Using Evaluation Metrics

API Request with Evaluation

Combined Evaluation + Refinement

Model Selection for Evaluation

Best Practices

Next Steps