Skip to main content

Overview

The Refinement Pipeline is an iterative quality improvement system that automatically enhances normalized claims through AI-powered feedback and self-correction. It uses DeepEval metrics to evaluate claims and refinement algorithms to improve them until they meet quality thresholds.
The refinement service is located in api/services/refinement/refine.py and integrates with DeepEval’s G-Eval metrics for quality assessment.

How Refinement Works

Architecture Overview

The RefinementService Class

The core refinement engine accepts configurable parameters:
From api/services/refinement/refine.py:46-62
class RefinementService:
    def __init__(
        self, 
        model: Union[GPTModel, GeminiModel, AnthropicModel, GrokModel], 
        threshold: float = 0.5,
        max_iters: int = 3,
        metrics: Optional[List[str]] = None,
    ):
        self.model = model  # DeepEval model for evaluation
        self.threshold = threshold  # Minimum quality score
        self.max_iters = max_iters  # Maximum refinement iterations
        self.metrics = metrics  # Custom evaluation metrics
model
DeepEval Model
required
DeepEval-compatible model instance (GPT, Gemini, Anthropic, or Grok)
threshold
float
default:"0.5"
Minimum quality score (0.0-1.0) required to accept a claim
max_iters
int
default:"3"
Maximum number of refinement iterations before returning the best result
metrics
GEval | None
default:"None"
Custom G-Eval metric. If None, uses default claim quality assessment

Refinement Algorithms

Self-Refine Algorithm

The self-refine algorithm improves claims through iterative self-correction:
1

Initial Evaluation

Evaluate the original claim using G-Eval metrics
test_case = LLMTestCase(
    input=original_query,
    actual_output=current_claim,
)

eval_result = evaluate(test_cases=[test_case], metrics=[eval_metric])
original_score = eval_result.test_results[0].metrics_data[0].score
2

Threshold Check

If the score meets the threshold, return the original claim
if original_score >= self.threshold:
    return current_response, refinement_history
3

Iterative Refinement

Generate feedback and refine the claim up to max_iters times
for i in range(self.max_iters):
    refine_user_prompt = f"""
    ## Original Query
    {original_query}
    
    ## Current Response  
    {current_claim}
    
    ## Feedback
    {eval_result.test_results[0].metrics_data[0].reason}
    
    ## Task
    Refine the current response based on the feedback to 
    improve its accuracy, verifiability, and overall quality.
    """
    
    refined_response = client.generate_response(
        user_prompt=refine_user_prompt,
        sys_prompt=self.refine_sys_prompt
    )
4

Re-evaluation

Evaluate the refined claim and check if it meets the threshold
test_case = LLMTestCase(
    input=original_query,
    actual_output=refined_claim,
)

eval_result = evaluate(test_cases=[test_case], metrics=[eval_metric])
score = eval_result.test_results[0].metrics_data[0].score

if score >= self.threshold:
    break  # Success!

Cross-Refine Algorithm

Cross-refine uses feedback from a different model to provide diverse perspectives:
# From api/_utils/prompts.py:217-230
feedback_prompt = """
You are provided with a generated response and a user prompt.
Your task is to provide detailed, constructive feedback based on 
the criteria provided.

Please score the response on the following criteria using a 0-10 
scale:
1. **Verifiability**
2. **Likelihood of Being False**
3. **Public Interest**
4. **Potential Harm**
5. **Check-Worthiness**

For each criterion, provide:
- A score (0-10)
- Provide a short, precise justification in 1 sentence.
"""

DeepEval Integration

G-Eval Metrics

Refinement uses DeepEval’s G-Eval (GPT-Evaluation) for quality assessment:
Default G-Eval Configuration
# From api/services/refinement/refine.py:76-83
eval_metric = GEval(
    name="Claim Quality Assessment",
    criteria=STATIC_EVAL_SPECS.criteria,
    evaluation_params=[LLMTestCaseParams.INPUT, 
                      LLMTestCaseParams.ACTUAL_OUTPUT],
    model=self.model,
    threshold=self.threshold
)

Static Evaluation Criteria

The default evaluation criteria from api/types/evals.py:25-50:
STATIC_EVAL_SPECS = StaticEvaluation(
    criteria="""Evaluate the normalized claim against the following 
    criteria: Verifiability and Self-Containment, Claim Centrality 
    and Extraction Quality, Conciseness and Clarity, 
    Check-Worthiness Alignment, and Factual Consistency""",
    
    evaluation_steps=[
        # Verifiability and Self-Containment
        "Check if the claim contains verifiable factual assertions ",
        "Check if the claim is self-contained without requiring 
         additional context",
        
        # Claim Centrality and Extraction Quality
        "Check if the normalized claim captures the central assertion",
        "Check if the claim represents the core factual assertion",
        
        # Conciseness and Clarity
        "Check if the claim is presented in a straightforward, 
         concise manner",
        "Check if the claim is significantly shorter than source posts",
        
        # Check-Worthiness Alignment
        "Check if the normalized claim meets check-worthiness standards",
        "Check if the claim has general public interest, potential for 
         harm",
        
        # Factual Consistency
        "Check if the normalized claim is factually consistent with 
         the source",
        "Check if the claim accurately reflects the original assertion",
    ]
)

Thread-Safe Execution

DeepEval creates its own event loop, which conflicts with FastAPI’s uvloop. CheckThat AI uses a thread pool executor to run evaluations safely:
Thread Pool Implementation
# From api/services/refinement/refine.py:34-44
from concurrent.futures import ThreadPoolExecutor

_executor = ThreadPoolExecutor(max_workers=4)

def _run_evaluation_in_thread(test_case: LLMTestCase, 
                              metric: BaseMetric):
    """
    Run DeepEval evaluation in a separate thread to avoid 
    uvloop conflicts.
    """
    return evaluate(test_cases=[test_case], metrics=[metric])

# Usage
future = _executor.submit(_run_evaluation_in_thread, 
                          test_case, eval_metric)
eval_result = future.result()  # Blocks until complete

Refinement History Tracking

Every refinement iteration is tracked and returned to the user:
From api/types/completions.py:39-54
class ClaimType(str, Enum):
    ORIGINAL = "original"
    REFINED = "refined"
    FINAL = "final"

class RefinementHistory(BaseModel):
    claim_type: ClaimType
    claim: Optional[str]
    score: float  # 0.0 to 1.0
    feedback: Optional[str]

class RefinementMetadata(BaseModel):
    metric_used: Optional[str]
    threshold: Optional[float]
    refinement_model: Optional[str]
    refinement_history: List[RefinementHistory]

Example Refinement History

{
  "refinement_metadata": {
    "metric_used": "Claim Quality Assessment",
    "threshold": 0.7,
    "refinement_model": "gpt-4o",
    "refinement_history": [
      {
        "claim_type": "original",
        "claim": "Drinking lots of water cures coronavirus",
        "score": 0.45,
        "feedback": "Claim is not self-contained and overstates effectiveness"
      },
      {
        "claim_type": "refined",
        "claim": "Some health sources recommend drinking water to help prevent coronavirus infection",
        "score": 0.72,
        "feedback": "Improved verifiability and reduced overgeneralization"
      },
      {
        "claim_type": "final",
        "claim": "Some health sources recommend drinking water to help prevent coronavirus infection",
        "score": 0.72,
        "feedback": "Meets quality threshold"
      }
    ]
  }
}

Using the Refinement Pipeline

API Request with Refinement

import openai

client = openai.OpenAI(
    base_url="https://api.checkthat.ai/v1",
    api_key="your-checkthat-api-key"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Eating garlic prevents COVID-19"}
    ],
    # Refinement parameters
    extra_body={
        "refine_claims": True,
        "refine_threshold": 0.7,
        "refine_max_iters": 3,
        "refine_model": "gpt-4o"
    }
)

print(response.choices[0].message.content)
print(response.refinement_metadata.refinement_history)
refine_claims
bool
default:"false"
Enable the refinement pipeline
refine_threshold
float
default:"0.5"
Minimum quality score to accept (0.0-1.0)
refine_max_iters
int
default:"3"
Maximum refinement iterations
refine_model
string
Model to use for refinement (supports all supported models)

Performance Considerations

Latency

Each refinement iteration adds ~2-5 seconds depending on the model. Plan for 6-15 seconds total with 3 iterations.

Cost

Each iteration doubles the API calls (evaluation + refinement). Use higher thresholds for cost-sensitive applications.

Quality

Higher thresholds (0.7-0.9) produce better claims but may require more iterations. Balance quality vs. speed/cost.

Model Selection

Stronger models (GPT-4, Claude Opus) produce better refinements. Faster models (GPT-4o-mini, Claude Haiku) reduce latency.

Error Handling

The refinement pipeline gracefully handles failures:
From api/services/refinement/refine.py:172-185
except Exception as e:
    logger.warning(f"Failed to refine claim: {e}")
    # Return original response with error in history
    error_history = RefinementHistory(
        claim_type=ClaimType.FINAL,
        claim=current_claim,
        score=0.0,
        feedback=f"Refinement failed: {str(e)}"
    )
    return current_response or original_response, refinement_history
If refinement fails, the API returns the original claim with error details in the refinement_history. Your application will never receive an error response due to refinement issues.

Next Steps

Evaluation Metrics

Learn about G-Eval and other quality metrics used in refinement

Supported Models

Choose the best model for your refinement needs

Build docs developers (and LLMs) love