Iteratively improve claim quality using G-Eval feedback loops
Self-Refine is CheckThat AI’s iterative improvement algorithm that automatically refines normalized claims using G-Eval feedback. The system evaluates each claim, provides constructive feedback, and generates improved versions until quality thresholds are met.
for i in range(self.max_iters): refine_user_prompt = f""" ## Original Query {original_query} ## Current Response {current_claim} ## Feedback {eval_result.test_results[0].metrics_data[0].reason} ## Task Refine the current response based on the feedback... """ refined_response = client.generate_response( user_prompt=refine_user_prompt, sys_prompt=self.refine_sys_prompt )
STATIC_EVAL_SPECS = StaticEvaluation( criteria="""Evaluate the normalized claim against the following criteria: Verifiability and Self-Containment, Claim Centrality and Extraction Quality, Conciseness and Clarity, Check-Worthiness Alignment, and Factual Consistency""", evaluation_steps=[ # Verifiability and Self-Containment "Check if the claim contains verifiable factual assertions that can be independently checked", "Check if the claim is self-contained without requiring additional context from the original post", # Claim Centrality and Extraction Quality "Check if the normalized claim captures the central assertion from the source text", "Check if the claim represents the core factual assertion that requires fact-checking", # Conciseness and Clarity "Check if the claim is presented in a straightforward, concise manner", "Check if the claim is significantly shorter than source posts while preserving essential meaning", # Check-Worthiness Alignment "Check if the normalized claim meets check-worthiness standards for fact-verification", "Check if the claim has general public interest, potential for harm, and likelihood of being false", # Factual Consistency "Check if the normalized claim is factually consistent with the source material", "Check if the claim accurately reflects the original assertion without introducing new information", ])
Two specialized prompts guide the refinement process (from prompts.py):
Feedback System Prompt (lines 232-250)
feedback_sys_prompt = """# IdentityYou are a professional fact-checker and an expert in claim normalization. # InstructionsYour task is to provide detailed, constructive feedback on the generated response based on the criteria provided to ensure that the normalized claims are not only consistent with the original post, but are also self-contained and verifiable.We want to iteratively improve the above generated response. To help with this, please score the response on the following criteria using a 0–10 scale, and provide a brief justification for each score:1. **Verifiability:** To what extent does the response contain claims that can be independently verified using reliable sources? (0 = not verifiable, 10 = fully verifiable)2. **Likelihood of Being False:** How likely is it that the response contains false or misleading information? (0 = very unlikely, 10 = very likely)3. **Public Interest:** How likely is the response to be of general public interest or relevance? (0 = not interesting, 10 = highly interesting)4. **Potential Harm:** How likely is the response to be harmful, offensive, or cause negative consequences? (0 = not harmful, 10 = extremely harmful)5. **Check-Worthiness:** How important is it to fact-check this response? (0 = not worth fact-checking, 10 = highly worth fact-checking)For each criterion, provide:- A score (0-10)- Provide a short, precise justification in 1 sentence.Optionally, suggest specific improvements to the response based on your evaluation."""
Refinement System Prompt (lines 252-264)
refine_sys_prompt = """# IdentityYou are a professional fact-checker and expert in claim normalization. # Instructions* Your task is to refine the generated response in light of the feedback provided.* Using the feedback provided, return a refined version of the generated response, ensuring that the normalized claim is consistent with the original post, self-contained, and verifiable.* Your response must only be based on the feedback provided.* Do not speculate, provide subjective opinions, or add any additional information or explanations. * Only include the refined, normalized claim in your response. * If no meaningful refinement is necessary, re-output the original normalized claim as-is.* If the response is not decontextualized, stand-alone, and verifiable, improve the response by adding more context from the original post if needed."""
Corona virus before it reaches the lungs it remains in the throat for four days … drinking water a lot and gargling with warm water & salt or vinegar eliminates the virus …
Initial Extraction (Score: 0.45):
Gargling eliminates coronavirus
G-Eval Feedback:
Verifiability: 6/10 - Testable but lacks specificity
Self-Containment: 4/10 - Missing context about prevention vs. treatment
Check-Worthiness: 8/10 - Health misinformation, high priority
Issues:
Too vague (“eliminates” is absolute)
Missing specificity (what kind of gargling?)
Unclear claim scope (prevention or cure?)
Refined Claim (Score: 0.65):
Gargling with warm salt water or vinegar can eliminate coronavirus from the throat
G-Eval Feedback:
Verifiability: 7/10 - More specific but still absolute claim
Self-Containment: 7/10 - Better context
Factual Consistency: 5/10 - “Eliminate” is too strong
Lieutenant Retired General Asif Mumtaz appointed as Chairman Pakistan Medical Commission PMC Lieutenant Retired General Asif Mumtaz appointed as Chairman Pakistan Medical Commission PMC Lieutenant Retired General Asif Mumtaz appointed as Chairman Pakistan Medical Commission PMC None.
Initial Claim (Score: 0.58):
Asif Mumtaz appointed as PMC Chairman
Feedback:
Self-Containment: 5/10 - “PMC” acronym not explained
Context: 4/10 - Missing appointing authority
Named Entities: 6/10 - Full title not preserved
Refined Claim (Score: 0.68):
Lieutenant Retired General Asif Mumtaz appointed as Chairman of Pakistan Medical Commission
Feedback:
Self-Containment: 7/10 - Better but missing who appointed him
Clarity: 6/10 - “Lieutenant Retired General” is awkward
Context: 5/10 - Appointing authority still missing
Final Claim (Score: 0.75 ✓):
Pakistani government appoints former army general to head medical regulatory body
Improvements:
✅ Simplified title (“former army general” vs. full rank)
✅ Added appointing authority (“Pakistani government”)
✅ Clarified role (“medical regulatory body”)
✅ More natural language structure
Begin with threshold=0.5 and max_iters=2, then increase
Monitor Convergence
Track score improvements per iteration to optimize settings
Use Different Models
Generate with one model, evaluate with another for diversity
Async Processing
Run Self-Refine in background for non-urgent claims
For high-throughput systems, consider implementing a two-tier approach: fast zero-shot for all claims, then Self-Refine for high-priority items during off-peak hours.