Refinement Pipeline

Overview

The Refinement Pipeline is an iterative quality improvement system that automatically enhances normalized claims through AI-powered feedback and self-correction. It uses DeepEval metrics to evaluate claims and refinement algorithms to improve them until they meet quality thresholds.

The refinement service is located in api/services/refinement/refine.py and integrates with DeepEval’s G-Eval metrics for quality assessment.

Architecture Overview

The RefinementService Class

The core refinement engine accepts configurable parameters:

From api/services/refinement/refine.py:46-62

class RefinementService:
    def __init__(
        self, 
        model: Union[GPTModel, GeminiModel, AnthropicModel, GrokModel], 
        threshold: float = 0.5,
        max_iters: int = 3,
        metrics: Optional[List[str]] = None,
    ):
        self.model = model  # DeepEval model for evaluation
        self.threshold = threshold  # Minimum quality score
        self.max_iters = max_iters  # Maximum refinement iterations
        self.metrics = metrics  # Custom evaluation metrics

model

DeepEval Model

required

DeepEval-compatible model instance (GPT, Gemini, Anthropic, or Grok)

threshold

float

default:"0.5"

Minimum quality score (0.0-1.0) required to accept a claim

max_iters

int

default:"3"

Maximum number of refinement iterations before returning the best result

metrics

GEval | None

default:"None"

Custom G-Eval metric. If None, uses default claim quality assessment

Self-Refine Algorithm

The self-refine algorithm improves claims through iterative self-correction:

Initial Evaluation

Evaluate the original claim using G-Eval metrics

test_case = LLMTestCase(
    input=original_query,
    actual_output=current_claim,
)

eval_result = evaluate(test_cases=[test_case], metrics=[eval_metric])
original_score = eval_result.test_results[0].metrics_data[0].score

Threshold Check

If the score meets the threshold, return the original claim

if original_score >= self.threshold:
    return current_response, refinement_history

Iterative Refinement

Generate feedback and refine the claim up to max_iters times

for i in range(self.max_iters):
    refine_user_prompt = f"""
    ## Original Query
    {original_query}
    
    ## Current Response  
    {current_claim}
    
    ## Feedback
    {eval_result.test_results[0].metrics_data[0].reason}
    
    ## Task
    Refine the current response based on the feedback to 
    improve its accuracy, verifiability, and overall quality.
    """
    
    refined_response = client.generate_response(
        user_prompt=refine_user_prompt,
        sys_prompt=self.refine_sys_prompt
    )

Re-evaluation

Evaluate the refined claim and check if it meets the threshold

test_case = LLMTestCase(
    input=original_query,
    actual_output=refined_claim,
)

eval_result = evaluate(test_cases=[test_case], metrics=[eval_metric])
score = eval_result.test_results[0].metrics_data[0].score

if score >= self.threshold:
    break  # Success!

Cross-Refine Algorithm

Cross-refine uses feedback from a different model to provide diverse perspectives:

# From api/_utils/prompts.py:217-230
feedback_prompt = """
You are provided with a generated response and a user prompt.
Your task is to provide detailed, constructive feedback based on 
the criteria provided.

Please score the response on the following criteria using a 0-10 
scale:
1. **Verifiability**
2. **Likelihood of Being False**
3. **Public Interest**
4. **Potential Harm**
5. **Check-Worthiness**

For each criterion, provide:
- A score (0-10)
- Provide a short, precise justification in 1 sentence.
"""

DeepEval Integration

G-Eval Metrics

Refinement uses DeepEval’s G-Eval (GPT-Evaluation) for quality assessment:

Default G-Eval Configuration

# From api/services/refinement/refine.py:76-83
eval_metric = GEval(
    name="Claim Quality Assessment",
    criteria=STATIC_EVAL_SPECS.criteria,
    evaluation_params=[LLMTestCaseParams.INPUT, 
                      LLMTestCaseParams.ACTUAL_OUTPUT],
    model=self.model,
    threshold=self.threshold
)

Static Evaluation Criteria

The default evaluation criteria from api/types/evals.py:25-50:

View Complete Evaluation Criteria

STATIC_EVAL_SPECS = StaticEvaluation(
    criteria="""Evaluate the normalized claim against the following 
    criteria: Verifiability and Self-Containment, Claim Centrality 
    and Extraction Quality, Conciseness and Clarity, 
    Check-Worthiness Alignment, and Factual Consistency""",
    
    evaluation_steps=[
        # Verifiability and Self-Containment
        "Check if the claim contains verifiable factual assertions ",
        "Check if the claim is self-contained without requiring 
         additional context",
        
        # Claim Centrality and Extraction Quality
        "Check if the normalized claim captures the central assertion",
        "Check if the claim represents the core factual assertion",
        
        # Conciseness and Clarity
        "Check if the claim is presented in a straightforward, 
         concise manner",
        "Check if the claim is significantly shorter than source posts",
        
        # Check-Worthiness Alignment
        "Check if the normalized claim meets check-worthiness standards",
        "Check if the claim has general public interest, potential for 
         harm",
        
        # Factual Consistency
        "Check if the normalized claim is factually consistent with 
         the source",
        "Check if the claim accurately reflects the original assertion",
    ]
)

Thread-Safe Execution

DeepEval creates its own event loop, which conflicts with FastAPI’s uvloop. CheckThat AI uses a thread pool executor to run evaluations safely:

Thread Pool Implementation

# From api/services/refinement/refine.py:34-44
from concurrent.futures import ThreadPoolExecutor

_executor = ThreadPoolExecutor(max_workers=4)

def _run_evaluation_in_thread(test_case: LLMTestCase, 
                              metric: BaseMetric):
    """
    Run DeepEval evaluation in a separate thread to avoid 
    uvloop conflicts.
    """
    return evaluate(test_cases=[test_case], metrics=[metric])

# Usage
future = _executor.submit(_run_evaluation_in_thread, 
                          test_case, eval_metric)
eval_result = future.result()  # Blocks until complete

Refinement History Tracking

Every refinement iteration is tracked and returned to the user:

From api/types/completions.py:39-54

class ClaimType(str, Enum):
    ORIGINAL = "original"
    REFINED = "refined"
    FINAL = "final"

class RefinementHistory(BaseModel):
    claim_type: ClaimType
    claim: Optional[str]
    score: float  # 0.0 to 1.0
    feedback: Optional[str]

class RefinementMetadata(BaseModel):
    metric_used: Optional[str]
    threshold: Optional[float]
    refinement_model: Optional[str]
    refinement_history: List[RefinementHistory]

Example Refinement History

{
  "refinement_metadata": {
    "metric_used": "Claim Quality Assessment",
    "threshold": 0.7,
    "refinement_model": "gpt-4o",
    "refinement_history": [
      {
        "claim_type": "original",
        "claim": "Drinking lots of water cures coronavirus",
        "score": 0.45,
        "feedback": "Claim is not self-contained and overstates effectiveness"
      },
      {
        "claim_type": "refined",
        "claim": "Some health sources recommend drinking water to help prevent coronavirus infection",
        "score": 0.72,
        "feedback": "Improved verifiability and reduced overgeneralization"
      },
      {
        "claim_type": "final",
        "claim": "Some health sources recommend drinking water to help prevent coronavirus infection",
        "score": 0.72,
        "feedback": "Meets quality threshold"
      }
    ]
  }
}

import openai

client = openai.OpenAI(
    base_url="https://api.checkthat.ai/v1",
    api_key="your-checkthat-api-key"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Eating garlic prevents COVID-19"}
    ],
    # Refinement parameters
    extra_body={
        "refine_claims": True,
        "refine_threshold": 0.7,
        "refine_max_iters": 3,
        "refine_model": "gpt-4o"
    }
)

print(response.choices[0].message.content)
print(response.refinement_metadata.refinement_history)

refine_claims

bool

default:"false"

Enable the refinement pipeline

refine_threshold

float

default:"0.5"

Minimum quality score to accept (0.0-1.0)

refine_max_iters

int

default:"3"

Maximum refinement iterations

refine_model

string

Model to use for refinement (supports all supported models)

Performance Considerations

Latency

Each refinement iteration adds ~2-5 seconds depending on the model. Plan for 6-15 seconds total with 3 iterations.

Cost

Each iteration doubles the API calls (evaluation + refinement). Use higher thresholds for cost-sensitive applications.

Quality

Higher thresholds (0.7-0.9) produce better claims but may require more iterations. Balance quality vs. speed/cost.

Model Selection

Stronger models (GPT-4, Claude Opus) produce better refinements. Faster models (GPT-4o-mini, Claude Haiku) reduce latency.

Error Handling

The refinement pipeline gracefully handles failures:

From api/services/refinement/refine.py:172-185

except Exception as e:
    logger.warning(f"Failed to refine claim: {e}")
    # Return original response with error in history
    error_history = RefinementHistory(
        claim_type=ClaimType.FINAL,
        claim=current_claim,
        score=0.0,
        feedback=f"Refinement failed: {str(e)}"
    )
    return current_response or original_response, refinement_history

If refinement fails, the API returns the original claim with error details in the refinement_history. Your application will never receive an error response due to refinement issues.

Get Started

Core Concepts

Web Application

Guides

Deployment

Overview

How Refinement Works

Architecture Overview

The RefinementService Class

Refinement Algorithms

Self-Refine Algorithm

Cross-Refine Algorithm

DeepEval Integration

G-Eval Metrics

Static Evaluation Criteria

Thread-Safe Execution

Refinement History Tracking

Example Refinement History

Using the Refinement Pipeline

API Request with Refinement

Performance Considerations

Latency

Cost

Quality

Model Selection

Error Handling

Next Steps

Evaluation Metrics

Supported Models

Build docs developers (and LLMs) love

Get Started

Core Concepts

Web Application

Guides

Deployment

​Overview

​How Refinement Works

​Architecture Overview

​The RefinementService Class

​Refinement Algorithms

​Self-Refine Algorithm

​Cross-Refine Algorithm

​DeepEval Integration

​G-Eval Metrics

​Static Evaluation Criteria

​Thread-Safe Execution

​Refinement History Tracking

​Example Refinement History

​Using the Refinement Pipeline

​API Request with Refinement

​Performance Considerations

Latency

Cost

Quality

Model Selection

​Error Handling

​Next Steps

Evaluation Metrics

Supported Models

Build docs developers (and LLMs) love

Overview

How Refinement Works

Architecture Overview

The RefinementService Class

Refinement Algorithms

Self-Refine Algorithm

Cross-Refine Algorithm

DeepEval Integration

G-Eval Metrics

Static Evaluation Criteria

Thread-Safe Execution

Refinement History Tracking

Example Refinement History

Using the Refinement Pipeline

API Request with Refinement

Performance Considerations

Error Handling

Next Steps