Skip to main content
Self-Refine is CheckThat AI’s iterative improvement algorithm that automatically refines normalized claims using G-Eval feedback. The system evaluates each claim, provides constructive feedback, and generates improved versions until quality thresholds are met.

How Self-Refine Works

The algorithm implements a feedback loop between claim generation and evaluation:
1

Initial Claim Generation

Generate the first normalized claim using your chosen prompting strategy
2

G-Eval Assessment

Evaluate the claim against quality criteria (verifiability, check-worthiness, etc.)
3

Feedback Generation

If score < threshold, generate specific improvement suggestions
4

Claim Refinement

Apply feedback to create improved version
5

Repeat or Complete

Continue until threshold met or max iterations reached

Implementation Details

The Self-Refine algorithm is implemented in refine.py:46-185. Here’s how it works:

RefinementService Class

class RefinementService:
    def __init__(
        self, 
        model: Union[GPTModel, GeminiModel, AnthropicModel, GrokModel], 
        threshold: float = 0.5,
        max_iters: int = 3,
        metrics: Optional[List[str]] = None,
    ):
        self.model = model  # DeepEval model for G-Eval
        self.threshold = threshold  # Minimum acceptable score
        self.max_iters = max_iters  # Maximum refinement iterations
        self.metrics = metrics  # Custom evaluation metrics

Core Algorithm

The refine_single_claim() method (lines 63-185) implements the feedback loop:
  1. Initial Evaluation:
    eval_metric = GEval(
        name="Claim Quality Assessment",
        criteria=STATIC_EVAL_SPECS.criteria,
        evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
        model=self.model,
        threshold=self.threshold
    )
    
    test_case = LLMTestCase(
        input=original_query,
        actual_output=current_claim,
    )
    
    eval_result = evaluate([test_case], [eval_metric])
    
  2. Score Tracking:
    refinement_history.append(RefinementHistory(
        claim_type=ClaimType.ORIGINAL,
        claim=current_claim,
        score=original_score,
        feedback=original_feedback
    ))
    
  3. Iterative Refinement:
    for i in range(self.max_iters):
        refine_user_prompt = f"""
        ## Original Query
        {original_query}
        
        ## Current Response  
        {current_claim}
        
        ## Feedback
        {eval_result.test_results[0].metrics_data[0].reason}
        
        ## Task
        Refine the current response based on the feedback...
        """
        
        refined_response = client.generate_response(
            user_prompt=refine_user_prompt, 
            sys_prompt=self.refine_sys_prompt
        )
    
  4. Early Stopping:
    if score >= self.threshold:
        break
    

API Usage

Enable Self-Refine by setting refine_claims=True in your API request:
from checkthat import CheckThat

client = CheckThat(api_key="your-api-key")

response = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[
        {
            "role": "user",
            "content": "Eating vaginal fluids makes you immune to cancer. Scientists at St. Austin University investigated..."
        }
    ],
    # Self-Refine configuration
    refine_claims=True,
    refine_model="gpt-5-2025-08-07",  # Model for evaluation
    refine_threshold=0.7,  # Target quality score (0.0-1.0)
    refine_max_iters=3  # Maximum refinement iterations
)

# Access refinement metadata
print(f"Final claim: {response.choices[0].message.content}")
print(f"Refinement history: {response.refinement_metadata.refinement_history}")

Configuration Options

refine_claims

refine_claims
boolean
default:"false"
Enable or disable the Self-Refine algorithm
refine_claims=True  # Enable refinement

refine_model

refine_model
string
required
Model to use for G-Eval evaluation. Must be specified when refine_claims=True.
Supported models (from _types.py):
  • gpt-5-2025-08-07 (GPT-5)
  • gpt-5-nano-2025-08-07 (GPT-5 nano)
  • o3-2025-04-16 (o3)
  • o4-mini-2025-04-16 (o4-mini)
refine_model="gpt-5-2025-08-07"  # Use GPT-5 for evaluation
Use a different model for refinement than generation to get diverse perspectives

refine_threshold

refine_threshold
float
default:"0.5"
Minimum acceptable quality score (0.0 to 1.0). Refinement stops when this score is reached.
refine_threshold=0.7  # Require 70% quality score
Recommended thresholds:
  • 0.5 - Basic quality (default)
  • 0.7 - Good quality (recommended for production)
  • 0.85 - High quality (research-grade)
  • 0.95 - Exceptional quality (may not converge)

refine_max_iters

refine_max_iters
integer
default:"3"
Maximum number of refinement iterations. Prevents infinite loops.
refine_max_iters=5  # Allow up to 5 refinement attempts
Each iteration adds latency and token costs. Start with 3 iterations and adjust based on results.

G-Eval Feedback Loop

The refinement system uses G-Eval metrics defined in evals.py:25-50:

Default Evaluation Criteria

STATIC_EVAL_SPECS = StaticEvaluation(
    criteria="""Evaluate the normalized claim against the following criteria: 
    Verifiability and Self-Containment, 
    Claim Centrality and Extraction Quality,
    Conciseness and Clarity, 
    Check-Worthiness Alignment, 
    and Factual Consistency""",
    
    evaluation_steps=[
        # Verifiability and Self-Containment
        "Check if the claim contains verifiable factual assertions that can be independently checked",
        "Check if the claim is self-contained without requiring additional context from the original post",
        
        # Claim Centrality and Extraction Quality
        "Check if the normalized claim captures the central assertion from the source text",
        "Check if the claim represents the core factual assertion that requires fact-checking",
        
        # Conciseness and Clarity
        "Check if the claim is presented in a straightforward, concise manner",
        "Check if the claim is significantly shorter than source posts while preserving essential meaning",
        
        # Check-Worthiness Alignment
        "Check if the normalized claim meets check-worthiness standards for fact-verification",
        "Check if the claim has general public interest, potential for harm, and likelihood of being false",
        
        # Factual Consistency
        "Check if the normalized claim is factually consistent with the source material",
        "Check if the claim accurately reflects the original assertion without introducing new information",
    ]
)

Feedback System Prompts

Two specialized prompts guide the refinement process (from prompts.py):
feedback_sys_prompt = """
# Identity
You are a professional fact-checker and an expert in claim normalization. 

# Instructions
Your task is to provide detailed, constructive feedback on the generated response 
based on the criteria provided to ensure that the normalized claims are not only 
consistent with the original post, but are also self-contained and verifiable.

We want to iteratively improve the above generated response. To help with this, 
please score the response on the following criteria using a 0–10 scale, and provide 
a brief justification for each score:

1. **Verifiability:** To what extent does the response contain claims that can be 
   independently verified using reliable sources? (0 = not verifiable, 10 = fully verifiable)
2. **Likelihood of Being False:** How likely is it that the response contains false 
   or misleading information? (0 = very unlikely, 10 = very likely)
3. **Public Interest:** How likely is the response to be of general public interest 
   or relevance? (0 = not interesting, 10 = highly interesting)
4. **Potential Harm:** How likely is the response to be harmful, offensive, or cause 
   negative consequences? (0 = not harmful, 10 = extremely harmful)
5. **Check-Worthiness:** How important is it to fact-check this response? 
   (0 = not worth fact-checking, 10 = highly worth fact-checking)

For each criterion, provide:
- A score (0-10)
- Provide a short, precise justification in 1 sentence.

Optionally, suggest specific improvements to the response based on your evaluation.
"""
refine_sys_prompt = """
# Identity
You are a professional fact-checker and expert in claim normalization. 

# Instructions
* Your task is to refine the generated response in light of the feedback provided.
* Using the feedback provided, return a refined version of the generated response, 
  ensuring that the normalized claim is consistent with the original post, 
  self-contained, and verifiable.
* Your response must only be based on the feedback provided.
* Do not speculate, provide subjective opinions, or add any additional information 
  or explanations. 
* Only include the refined, normalized claim in your response. 
* If no meaningful refinement is necessary, re-output the original normalized claim as-is.
* If the response is not decontextualized, stand-alone, and verifiable, improve the 
  response by adding more context from the original post if needed.
"""

Before/After Examples

Example 1: Medical Misinformation

Input Post:
Corona virus before it reaches the lungs it remains in the throat for four days 
… drinking water a lot and gargling with warm water & salt or vinegar eliminates 
the virus …
Initial Extraction (Score: 0.45):
Gargling eliminates coronavirus
G-Eval Feedback:
  • Verifiability: 6/10 - Testable but lacks specificity
  • Self-Containment: 4/10 - Missing context about prevention vs. treatment
  • Check-Worthiness: 8/10 - Health misinformation, high priority
Issues:
  • Too vague (“eliminates” is absolute)
  • Missing specificity (what kind of gargling?)
  • Unclear claim scope (prevention or cure?)

Example 2: Institutional Claim

Input:
Lieutenant Retired General Asif Mumtaz appointed as Chairman Pakistan Medical 
Commission PMC Lieutenant Retired General Asif Mumtaz appointed as Chairman 
Pakistan Medical Commission PMC Lieutenant Retired General Asif Mumtaz 
appointed as Chairman Pakistan Medical Commission PMC None.
Initial Claim (Score: 0.58):
Asif Mumtaz appointed as PMC Chairman
Feedback:
  • Self-Containment: 5/10 - “PMC” acronym not explained
  • Context: 4/10 - Missing appointing authority
  • Named Entities: 6/10 - Full title not preserved

Response Structure

The API returns enhanced metadata when Self-Refine is enabled:
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1709568000,
  "model": "gpt-5-2025-08-07",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Gargling water can protect against coronavirus"
      },
      "finish_reason": "stop"
    }
  ],
  "refinement_metadata": {
    "metric_used": "Claim Quality Assessment",
    "threshold": 0.7,
    "refinement_model": "gpt-5-2025-08-07",
    "refinement_history": [
      {
        "claim_type": "original",
        "claim": "Gargling eliminates coronavirus",
        "score": 0.45,
        "feedback": "Verifiability: 6/10 - Testable but lacks specificity. Self-Containment: 4/10..."
      },
      {
        "claim_type": "refined",
        "claim": "Gargling with warm salt water or vinegar can eliminate coronavirus from the throat",
        "score": 0.65,
        "feedback": "Verifiability: 7/10 - More specific but 'eliminate' is too absolute..."
      },
      {
        "claim_type": "final",
        "claim": "Gargling water can protect against coronavirus",
        "score": 0.72,
        "feedback": "Verifiability: 8/10 - Clear testable claim. Threshold met."
      }
    ]
  }
}

Accessing Refinement History

response = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[{"role": "user", "content": post}],
    refine_claims=True,
    refine_model="gpt-5-2025-08-07",
    refine_threshold=0.7,
    refine_max_iters=3
)

# Access refinement metadata
metadata = response.refinement_metadata
print(f"Model: {metadata.refinement_model}")
print(f"Threshold: {metadata.threshold}")
print(f"Iterations: {len(metadata.refinement_history)}")

# Print each iteration
for i, history in enumerate(metadata.refinement_history):
    print(f"\n{history.claim_type.upper()} (Score: {history.score})")
    print(f"Claim: {history.claim}")
    print(f"Feedback: {history.feedback[:100]}...")

Performance Considerations

Token Usage

Each refinement iteration uses tokens for:
  • Original query
  • Current claim
  • Feedback text
  • System prompts
  • Refined claim
Estimated tokens per iteration:
  • Initial evaluation: ~500-800 tokens
  • Each refinement: ~600-1000 tokens
  • Total for 3 iterations: ~2000-3500 tokens

Latency

Self-Refine adds sequential API calls:
Total Time = Initial Generation + (Evaluation + Refinement) × Iterations
Typical latencies:
  • Initial generation: 1-3 seconds
  • Each evaluation: 2-4 seconds
  • Each refinement: 1-3 seconds
  • Total for 3 iterations: 10-25 seconds
Self-Refine is not suitable for real-time applications. Consider async processing or lower iteration counts.

Cost Optimization

Only refine claims that need it:
# Quick initial check
quick_response = client.chat.completions.create(
    model="gpt-5-nano-2025-08-07",  # Fast model
    messages=[{"role": "user", "content": post}]
)

# Only refine if claim seems complex
if needs_refinement(quick_response):
    refined = client.chat.completions.create(
        model="gpt-5-2025-08-07",
        messages=[{"role": "user", "content": post}],
        refine_claims=True,
        refine_model="gpt-5-2025-08-07",
        refine_threshold=0.7,
        refine_max_iters=2  # Reduce iterations
    )

Best Practices

Start Conservative

Begin with threshold=0.5 and max_iters=2, then increase

Monitor Convergence

Track score improvements per iteration to optimize settings

Use Different Models

Generate with one model, evaluate with another for diversity

Async Processing

Run Self-Refine in background for non-urgent claims
For high-throughput systems, consider implementing a two-tier approach: fast zero-shot for all claims, then Self-Refine for high-priority items during off-peak hours.

Troubleshooting

Problem: Scores plateau below thresholdSolutions:
  • Lower refine_threshold to 0.6 or 0.65
  • Increase refine_max_iters to 5
  • Try a more capable refine_model
  • Check if original post has enough verifiable content
# Debugging convergence issues
response = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[{"role": "user", "content": post}],
    refine_claims=True,
    refine_model="claude-opus-4-1-20250805",  # Try different model
    refine_threshold=0.65,  # Lower threshold
    refine_max_iters=5  # More attempts
)

# Analyze score progression
scores = [h.score for h in response.refinement_metadata.refinement_history]
print(f"Score progression: {scores}")
if scores[-1] - scores[0] < 0.1:
    print("Warning: Minimal improvement, may need different approach")
Problem: Refinement takes too longSolutions:
  • Reduce refine_max_iters to 2
  • Use faster models like gpt-5-nano or gemini-2.5-flash
  • Implement async processing
  • Cache evaluation results for similar claims
Problem: Same input produces different refinementsSolutions:
  • Set temperature=0 for deterministic outputs
  • Use seed parameter for reproducibility
  • Increase refine_threshold to force more iterations
response = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[{"role": "user", "content": post}],
    temperature=0,  # Deterministic
    seed=42,  # Reproducible
    refine_claims=True,
    refine_model="gpt-5-2025-08-07",
    refine_threshold=0.75
)
Problem: Missing required parameterSolution:
# Incorrect - missing refine_model
response = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[{"role": "user", "content": post}],
    refine_claims=True  # ❌ Error
)

# Correct - includes refine_model
response = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[{"role": "user", "content": post}],
    refine_claims=True,
    refine_model="gpt-5-2025-08-07"  # ✅ Required
)

Next Steps

Cross-Refine

Use multiple models for collaborative refinement

Custom Evaluation

Create your own G-Eval criteria

Build docs developers (and LLMs) love