Skip to main content
Cross-Refine extends the Self-Refine algorithm by leveraging multiple AI models in the refinement process. By using different models for generation, evaluation, and refinement, you can achieve higher quality claims through diverse perspectives.

Concept Overview

Cross-Refine implements a collaborative approach:
1

Generate with Model A

Create initial normalized claim using your primary model
2

Evaluate with Model B

Use a different model for G-Eval assessment to get alternative perspective
3

Refine with Model C

Optional: Use third model for refinement based on feedback
4

Consensus or Selection

Choose best result or create consensus claim

Why Multiple Models?

Different models have unique strengths:

OpenAI (GPT-5)

  • Excellent reasoning
  • Strong factuality
  • Good at complex claims

Anthropic (Claude)

  • Safety-focused
  • Nuanced understanding
  • Detailed explanations

Google (Gemini)

  • Broad knowledge
  • Multilingual support
  • Creative rephrasing

Supported Models

CheckThat AI supports 5 model providers (from _types.py:4-17):
OPENAI_MODELS = [
    "gpt-5-2025-08-07",           # GPT-5 (most capable)
    "gpt-5-nano-2025-08-07",      # GPT-5 nano (faster)
    "o3-2025-04-16",              # o3 reasoning
    "o4-mini-2025-04-16"          # o4-mini
]
Best for:
  • Initial claim generation
  • Complex reasoning tasks
  • Chain-of-thought prompting

Basic Cross-Refine Pattern

Two-Model Approach

Generate with one model, evaluate with another:
from checkthat import CheckThat

client = CheckThat(api_key="your-api-key")

post = """Eating vaginal fluids makes you immune to cancer. 
Scientists at St. Austin University in North Carolina investigated..."""

# Generate with GPT-5, evaluate with Claude
response = client.chat.completions.create(
    model="gpt-5-2025-08-07",              # Generation model
    messages=[{"role": "user", "content": post}],
    refine_claims=True,
    refine_model="claude-opus-4-1-20250805",  # Evaluation model (different!)
    refine_threshold=0.7,
    refine_max_iters=3
)

print(f"Final claim: {response.choices[0].message.content}")
print(f"\nRefinement by: {response.refinement_metadata.refinement_model}")
print(f"Iterations: {len(response.refinement_metadata.refinement_history)}")
Using different models for generation and evaluation reduces model bias and improves claim quality

Three-Model Approach

Generate with Model A, evaluate with Model B, refine with Model C:
from checkthat import CheckThat

client = CheckThat(api_key="your-api-key")

post = "Corona virus remains in throat for 4 days before reaching lungs. Gargling eliminates it."

# Step 1: Generate with GPT-5 nano (fast)
initial = client.chat.completions.create(
    model="gpt-5-nano-2025-08-07",
    messages=[{"role": "user", "content": post}]
)
initial_claim = initial.choices[0].message.content

# Step 2: Evaluate with Claude (thorough assessment)
evaluation_prompt = f"""
Original post: {post}
Extracted claim: {initial_claim}

As a professional fact-checker, evaluate this claim on a 0-1 scale for:
1. Verifiability
2. Self-containment  
3. Check-worthiness
4. Factual consistency

Provide specific improvement suggestions.
"""

evaluation = client.chat.completions.create(
    model="claude-opus-4-1-20250805",
    messages=[{"role": "user", "content": evaluation_prompt}]
)
feedback = evaluation.choices[0].message.content

# Step 3: Refine with Gemini (creative rephrasing)
refinement_prompt = f"""
Original post: {post}
Current claim: {initial_claim}

Expert feedback:
{feedback}

Refine the claim based on this feedback. Output only the improved claim.
"""

refined = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": refinement_prompt}]
)

final_claim = refined.choices[0].message.content
print(f"Initial (GPT-5 nano): {initial_claim}")
print(f"Final (Gemini after Claude eval): {final_claim}")

Model Selection Strategies

Strategy 1: Complementary Strengths

Pair models with complementary capabilities:
Use Case: Medical misinformation, harmful content
# Generate: Fast model for initial extraction
# Evaluate: Safety-focused model
response = client.chat.completions.create(
    model="gpt-5-nano-2025-08-07",        # Fast generation
    refine_model="claude-opus-4-1-20250805",  # Safety evaluation
    refine_claims=True,
    refine_threshold=0.8  # High bar for safety
)

Strategy 2: Consensus Building

Generate multiple claims and build consensus:
import asyncio
from collections import Counter

async def cross_refine_consensus(post: str, models: list):
    """Generate claims with multiple models and find consensus."""
    client = CheckThat(api_key="your-api-key")
    
    # Generate with multiple models in parallel
    tasks = [
        client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": post}],
            refine_claims=True,
            refine_model="claude-opus-4-1-20250805",  # Same evaluator
            refine_threshold=0.7
        )
        for model in models
    ]
    
    results = await asyncio.gather(*tasks)
    
    # Extract final claims and scores
    claims = [
        (
            r.choices[0].message.content,
            r.refinement_metadata.refinement_history[-1].score,
            model
        )
        for r, model in zip(results, models)
    ]
    
    # Sort by score
    claims.sort(key=lambda x: x[1], reverse=True)
    
    print("\nCross-Refine Results:")
    for claim, score, model in claims:
        print(f"\n{model} (score: {score:.2f})")
        print(f"  {claim}")
    
    # Return highest-scoring claim
    return claims[0]

# Use multiple models
models = [
    "gpt-5-2025-08-07",
    "claude-opus-4-1-20250805",
    "gemini-2.5-pro",
    "grok-4-0709"
]

post = """Hydrate YOURSELF. Water 30 min before a meal. 
DRINK before taking a shower. Helps activate internal organs, 
digest food, lower blood pressure."""

best_claim, best_score, best_model = asyncio.run(
    cross_refine_consensus(post, models)
)

print(f"\n\nBest Claim ({best_model}, {best_score:.2f}):")
print(best_claim)
Output:
Cross-Refine Results:

gpt-5-2025-08-07 (score: 0.72)
  Drinking water at specific times can have different health benefits

claude-opus-4-1-20250805 (score: 0.71)
  Drinking water at specific times throughout the day provides various health benefits

gemini-2.5-pro (score: 0.69)
  Timing water consumption can offer multiple health advantages

grok-4-0709 (score: 0.68)
  Strategic water intake timing may support various bodily functions


Best Claim (gpt-5-2025-08-07, 0.72):
Drinking water at specific times can have different health benefits

Strategy 3: Ensemble Refinement

Use multiple models for evaluation:
from checkthat import CheckThat
import numpy as np

client = CheckThat(api_key="your-api-key")

post = "Bruce Lee playing table tennis with nunchucks in 1970"

# Generate initial claim
initial = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[{"role": "user", "content": post}]
)
initial_claim = initial.choices[0].message.content

# Evaluate with multiple models
eval_models = [
    "claude-opus-4-1-20250805",
    "gemini-2.5-pro",
    "gpt-5-2025-08-07"
]

evaluations = []
for eval_model in eval_models:
    response = client.chat.completions.create(
        model="gpt-5-2025-08-07",  # Same generator
        messages=[{"role": "user", "content": post}],
        refine_claims=True,
        refine_model=eval_model,  # Different evaluators
        refine_threshold=0.7,
        refine_max_iters=2
    )
    
    final_history = response.refinement_metadata.refinement_history[-1]
    evaluations.append({
        "model": eval_model,
        "claim": response.choices[0].message.content,
        "score": final_history.score,
        "feedback": final_history.feedback
    })

# Calculate ensemble metrics
scores = [e["score"] for e in evaluations]
avg_score = np.mean(scores)
std_score = np.std(scores)

print(f"\nEnsemble Evaluation Results:")
print(f"Average Score: {avg_score:.3f}{std_score:.3f})")
print(f"\nIndividual Evaluations:")

for eval_result in evaluations:
    print(f"\n{eval_result['model']}: {eval_result['score']:.3f}")
    print(f"  Claim: {eval_result['claim']}")
    print(f"  Feedback: {eval_result['feedback'][:100]}...")

# Use claim with highest score or best consensus
best = max(evaluations, key=lambda x: x["score"])
print(f"\n\nRecommended Claim ({best['model']}):")
print(best['claim'])

Real-World Examples

Example 1: Medical Misinformation

Configuration:
model="gpt-5-2025-08-07"
refine_model="gpt-5-2025-08-07"  # Same model
Result (Score: 0.68):
Gargling with warm water and salt or vinegar eliminates coronavirus from throat
Issues:
  • “Eliminates” is too absolute
  • Single model bias

Example 2: Celebrity Content

Use Case: Low-priority viral content
# Quick two-pass
response = client.chat.completions.create(
    model="gpt-5-nano-2025-08-07",        # Fast generation
    refine_model="gpt-5-nano-2025-08-07",  # Fast evaluation
    refine_claims=True,
    refine_threshold=0.6,  # Lower bar
    refine_max_iters=1  # Single refinement
)
Result (Score: 0.62, 3 seconds):
Video shows Bruce Lee playing table tennis with nunchucks

Model Compatibility

All models support Cross-Refine through the unified DeepEval interface (from deepeval_model.py:27-56):
def getEvalModel(self) -> Union[GPTModel, GeminiModel, AnthropicModel, GrokModel]:
    """Map CheckThat models to DeepEval evaluation models."""
    
    if self.api_provider == 'OPENAI':
        return GPTModel(model=self.model, _openai_api_key=self.api_key)
        
    elif self.api_provider == 'XAI':
        return GrokModel(model=self.model, api_key=self.api_key)
        
    elif self.api_provider == 'ANTHROPIC':
        return AnthropicModel(model=self.model, _anthropic_api_key=self.api_key)
        
    elif self.api_provider == 'GEMINI':
        return GeminiModel(model=self.model, api_key=self.api_key)
Supported combinations:
  • Any generation model + Any evaluation model
  • Cross-provider refinement (OpenAI → Claude → Gemini)
  • Same-provider different versions (GPT-5 → GPT-5 nano)

Performance Comparison

Speed vs. Quality Trade-offs

Configuration:
model="gpt-5-nano-2025-08-07"              # Fast
refine_model="gpt-5-nano-2025-08-07"       # Fast
refine_threshold=0.6
refine_max_iters=1
Metrics:
  • Latency: 3-5 seconds
  • Cost: $
  • Quality Score: 0.60-0.65
  • Best for: High-volume, low-priority claims

Best Practices

Start Simple

Begin with single-model, then add cross-refine if needed

Match Use Case

High-stakes claims deserve premium model combinations

Monitor Costs

Multiple models multiply costs - track ROI

A/B Test

Compare model combinations on your specific data
For production systems, implement a tiered approach: fast models for initial filtering, cross-refine with premium models for high-priority claims.

Advanced Patterns

Pattern 1: Specialist Cascade

def specialist_cascade(post: str, domain: str):
    """Use specialized models based on domain."""
    client = CheckThat(api_key="your-api-key")
    
    # Domain-specific model selection
    specialist_map = {
        "medical": "claude-opus-4-1-20250805",  # Safety-focused
        "legal": "gpt-5-2025-08-07",            # Reasoning
        "multilingual": "gemini-2.5-pro",       # Language support
        "technical": "o3-2025-04-16",           # Deep reasoning
    }
    
    generation_model = specialist_map.get(domain, "gpt-5-2025-08-07")
    evaluation_model = "claude-opus-4-1-20250805"  # Always use Claude for eval
    
    return client.chat.completions.create(
        model=generation_model,
        messages=[{"role": "user", "content": post}],
        refine_claims=True,
        refine_model=evaluation_model,
        refine_threshold=0.75,
        refine_max_iters=3
    )

Pattern 2: Quality Threshold Escalation

async def escalating_refinement(post: str):
    """Escalate to better models if quality threshold not met."""
    client = CheckThat(api_key="your-api-key")
    
    model_tiers = [
        ("gpt-5-nano-2025-08-07", "gpt-5-nano-2025-08-07", 0.65),
        ("gpt-5-2025-08-07", "claude-sonnet-4-20250514", 0.75),
        ("o3-2025-04-16", "claude-opus-4-1-20250805", 0.85),
    ]
    
    for gen_model, eval_model, threshold in model_tiers:
        response = await client.chat.completions.create(
            model=gen_model,
            messages=[{"role": "user", "content": post}],
            refine_claims=True,
            refine_model=eval_model,
            refine_threshold=threshold,
            refine_max_iters=2
        )
        
        final_score = response.refinement_metadata.refinement_history[-1].score
        
        if final_score >= threshold:
            return response  # Success at this tier
    
    return response  # Return best attempt

Next Steps

Custom Evaluation

Define your own G-Eval criteria

API Reference

View complete parameter reference

Build docs developers (and LLMs) love