Cross-Refine with Multiple Models

Cross-Refine extends the Self-Refine algorithm by leveraging multiple AI models in the refinement process. By using different models for generation, evaluation, and refinement, you can achieve higher quality claims through diverse perspectives.

Concept Overview

Cross-Refine implements a collaborative approach:

Generate with Model A

Create initial normalized claim using your primary model

Evaluate with Model B

Use a different model for G-Eval assessment to get alternative perspective

Refine with Model C

Optional: Use third model for refinement based on feedback

Consensus or Selection

Choose best result or create consensus claim

Why Multiple Models?

Different models have unique strengths:

OpenAI (GPT-5)

Excellent reasoning
Strong factuality
Good at complex claims

Anthropic (Claude)

Safety-focused
Nuanced understanding
Detailed explanations

Google (Gemini)

Broad knowledge
Multilingual support
Creative rephrasing

Supported Models

CheckThat AI supports 5 model providers (from _types.py:4-17):

OpenAI
Anthropic
Google
xAI
Together AI

OPENAI_MODELS = [
    "gpt-5-2025-08-07",           # GPT-5 (most capable)
    "gpt-5-nano-2025-08-07",      # GPT-5 nano (faster)
    "o3-2025-04-16",              # o3 reasoning
    "o4-mini-2025-04-16"          # o4-mini
]

Best for:

Initial claim generation
Complex reasoning tasks
Chain-of-thought prompting

ANTHROPIC_MODELS = [
    "claude-sonnet-4-20250514",    # Claude Sonnet 4
    "claude-opus-4-1-20250805"     # Claude Opus 4.1 (most capable)
]

Best for:

Evaluation and feedback
Safety-critical claims
Nuanced language analysis

GEMINI_MODELS = [
    "gemini-2.5-pro",              # Gemini 2.5 Pro
    "gemini-2.5-flash"             # Gemini 2.5 Flash (faster)
]

Best for:

Multilingual content
Creative refinement
Broad knowledge queries

xAI_MODELS = [
    "grok-3",                      # Grok 3
    "grok-4-0709",                 # Grok 4 (most capable)
    "grok-3-mini"                  # Grok 3 Mini (faster)
]

Best for:

Alternative perspectives
Real-time data integration
Unconventional claims

TOGETHER_MODELS = [
    "meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
    "deepseek-ai/DeepSeek-R1-Distill-Llama-70B-free"
]

Best for:

Cost-effective processing
Open-source alternatives
High-throughput scenarios

Basic Cross-Refine Pattern

Two-Model Approach

Generate with one model, evaluate with another:

from checkthat import CheckThat

client = CheckThat(api_key="your-api-key")

post = """Eating vaginal fluids makes you immune to cancer. 
Scientists at St. Austin University in North Carolina investigated..."""

# Generate with GPT-5, evaluate with Claude
response = client.chat.completions.create(
    model="gpt-5-2025-08-07",              # Generation model
    messages=[{"role": "user", "content": post}],
    refine_claims=True,
    refine_model="claude-opus-4-1-20250805",  # Evaluation model (different!)
    refine_threshold=0.7,
    refine_max_iters=3
)

print(f"Final claim: {response.choices[0].message.content}")
print(f"\nRefinement by: {response.refinement_metadata.refinement_model}")
print(f"Iterations: {len(response.refinement_metadata.refinement_history)}")

Using different models for generation and evaluation reduces model bias and improves claim quality

Three-Model Approach

Generate with Model A, evaluate with Model B, refine with Model C:

from checkthat import CheckThat

client = CheckThat(api_key="your-api-key")

post = "Corona virus remains in throat for 4 days before reaching lungs. Gargling eliminates it."

# Step 1: Generate with GPT-5 nano (fast)
initial = client.chat.completions.create(
    model="gpt-5-nano-2025-08-07",
    messages=[{"role": "user", "content": post}]
)
initial_claim = initial.choices[0].message.content

# Step 2: Evaluate with Claude (thorough assessment)
evaluation_prompt = f"""
Original post: {post}
Extracted claim: {initial_claim}

As a professional fact-checker, evaluate this claim on a 0-1 scale for:
1. Verifiability
2. Self-containment  
3. Check-worthiness
4. Factual consistency

Provide specific improvement suggestions.
"""

evaluation = client.chat.completions.create(
    model="claude-opus-4-1-20250805",
    messages=[{"role": "user", "content": evaluation_prompt}]
)
feedback = evaluation.choices[0].message.content

# Step 3: Refine with Gemini (creative rephrasing)
refinement_prompt = f"""
Original post: {post}
Current claim: {initial_claim}

Expert feedback:
{feedback}

Refine the claim based on this feedback. Output only the improved claim.
"""

refined = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": refinement_prompt}]
)

final_claim = refined.choices[0].message.content
print(f"Initial (GPT-5 nano): {initial_claim}")
print(f"Final (Gemini after Claude eval): {final_claim}")

Model Selection Strategies

Strategy 1: Complementary Strengths

Pair models with complementary capabilities:

Safety-Critical
Multilingual
Complex Reasoning
Cost-Optimized

Use Case: Medical misinformation, harmful content

# Generate: Fast model for initial extraction
# Evaluate: Safety-focused model
response = client.chat.completions.create(
    model="gpt-5-nano-2025-08-07",        # Fast generation
    refine_model="claude-opus-4-1-20250805",  # Safety evaluation
    refine_claims=True,
    refine_threshold=0.8  # High bar for safety
)

Use Case: Non-English content

# Generate: Multilingual model
# Evaluate: Strong reasoning model
response = client.chat.completions.create(
    model="gemini-2.5-pro",               # Multilingual strength
    refine_model="gpt-5-2025-08-07",      # Reasoning evaluation
    refine_claims=True,
    refine_threshold=0.7
)

Use Case: Ambiguous claims requiring deep analysis

# Generate: Reasoning model with CoT
# Evaluate: Alternative reasoning model
response = client.chat.completions.create(
    model="o3-2025-04-16",                 # o3 reasoning
    refine_model="claude-opus-4-1-20250805",  # Alternative reasoning
    refine_claims=True,
    refine_threshold=0.75
)

Use Case: High-volume processing

# Generate: Free/cheap model
# Evaluate: Efficient model
response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",  # Free
    refine_model="gpt-5-nano-2025-08-07",  # Efficient evaluation
    refine_claims=True,
    refine_threshold=0.65,
    refine_max_iters=2  # Limit iterations
)

Strategy 2: Consensus Building

Generate multiple claims and build consensus:

import asyncio
from collections import Counter

async def cross_refine_consensus(post: str, models: list):
    """Generate claims with multiple models and find consensus."""
    client = CheckThat(api_key="your-api-key")
    
    # Generate with multiple models in parallel
    tasks = [
        client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": post}],
            refine_claims=True,
            refine_model="claude-opus-4-1-20250805",  # Same evaluator
            refine_threshold=0.7
        )
        for model in models
    ]
    
    results = await asyncio.gather(*tasks)
    
    # Extract final claims and scores
    claims = [
        (
            r.choices[0].message.content,
            r.refinement_metadata.refinement_history[-1].score,
            model
        )
        for r, model in zip(results, models)
    ]
    
    # Sort by score
    claims.sort(key=lambda x: x[1], reverse=True)
    
    print("\nCross-Refine Results:")
    for claim, score, model in claims:
        print(f"\n{model} (score: {score:.2f})")
        print(f"  {claim}")
    
    # Return highest-scoring claim
    return claims[0]

# Use multiple models
models = [
    "gpt-5-2025-08-07",
    "claude-opus-4-1-20250805",
    "gemini-2.5-pro",
    "grok-4-0709"
]

post = """Hydrate YOURSELF. Water 30 min before a meal. 
DRINK before taking a shower. Helps activate internal organs, 
digest food, lower blood pressure."""

best_claim, best_score, best_model = asyncio.run(
    cross_refine_consensus(post, models)
)

print(f"\n\nBest Claim ({best_model}, {best_score:.2f}):")
print(best_claim)

Output:

Cross-Refine Results:

gpt-5-2025-08-07 (score: 0.72)
  Drinking water at specific times can have different health benefits

claude-opus-4-1-20250805 (score: 0.71)
  Drinking water at specific times throughout the day provides various health benefits

gemini-2.5-pro (score: 0.69)
  Timing water consumption can offer multiple health advantages

grok-4-0709 (score: 0.68)
  Strategic water intake timing may support various bodily functions


Best Claim (gpt-5-2025-08-07, 0.72):
Drinking water at specific times can have different health benefits

Use multiple models for evaluation:

from checkthat import CheckThat
import numpy as np

client = CheckThat(api_key="your-api-key")

post = "Bruce Lee playing table tennis with nunchucks in 1970"

# Generate initial claim
initial = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[{"role": "user", "content": post}]
)
initial_claim = initial.choices[0].message.content

# Evaluate with multiple models
eval_models = [
    "claude-opus-4-1-20250805",
    "gemini-2.5-pro",
    "gpt-5-2025-08-07"
]

evaluations = []
for eval_model in eval_models:
    response = client.chat.completions.create(
        model="gpt-5-2025-08-07",  # Same generator
        messages=[{"role": "user", "content": post}],
        refine_claims=True,
        refine_model=eval_model,  # Different evaluators
        refine_threshold=0.7,
        refine_max_iters=2
    )
    
    final_history = response.refinement_metadata.refinement_history[-1]
    evaluations.append({
        "model": eval_model,
        "claim": response.choices[0].message.content,
        "score": final_history.score,
        "feedback": final_history.feedback
    })

# Calculate ensemble metrics
scores = [e["score"] for e in evaluations]
avg_score = np.mean(scores)
std_score = np.std(scores)

print(f"\nEnsemble Evaluation Results:")
print(f"Average Score: {avg_score:.3f} (±{std_score:.3f})")
print(f"\nIndividual Evaluations:")

for eval_result in evaluations:
    print(f"\n{eval_result['model']}: {eval_result['score']:.3f}")
    print(f"  Claim: {eval_result['claim']}")
    print(f"  Feedback: {eval_result['feedback'][:100]}...")

# Use claim with highest score or best consensus
best = max(evaluations, key=lambda x: x["score"])
print(f"\n\nRecommended Claim ({best['model']}):")
print(best['claim'])

Real-World Examples

Example 1: Medical Misinformation

GPT-5 Only
GPT-5 + Claude
Consensus (4 Models)

Configuration:

model="gpt-5-2025-08-07"
refine_model="gpt-5-2025-08-07"  # Same model

Result (Score: 0.68):

Gargling with warm water and salt or vinegar eliminates coronavirus from throat

Issues:

“Eliminates” is too absolute
Single model bias

Configuration:

model="gpt-5-2025-08-07"              # Generation
refine_model="claude-opus-4-1-20250805"  # Evaluation

Result (Score: 0.75):

Gargling water can protect against coronavirus

Improvements: ✅ Claude’s safety focus caught overstated claim ✅ More accurate representation ✅ Better hedge (“can protect” vs “eliminates”)

Models:

GPT-5: “Gargling can protect against coronavirus”
Claude: “Gargling water may help prevent coronavirus”
Gemini: “Throat gargling offers coronavirus protection”
Grok: “Gargling suggested as coronavirus preventive measure”

Consensus (Score: 0.78):

Gargling water can protect against coronavirus

Why Best: ✅ Highest average score across evaluators ✅ Most similar to other models (consensus) ✅ Balanced confidence level

Example 2: Celebrity Content

Fast Processing
Quality Processing

Use Case: Low-priority viral content

# Quick two-pass
response = client.chat.completions.create(
    model="gpt-5-nano-2025-08-07",        # Fast generation
    refine_model="gpt-5-nano-2025-08-07",  # Fast evaluation
    refine_claims=True,
    refine_threshold=0.6,  # Lower bar
    refine_max_iters=1  # Single refinement
)

Result (Score: 0.62, 3 seconds):

Video shows Bruce Lee playing table tennis with nunchucks

Use Case: Verification required

# Cross-model refinement
response = client.chat.completions.create(
    model="gpt-5-2025-08-07",              # Strong reasoning
    refine_model="gemini-2.5-pro",         # Creative refinement
    refine_claims=True,
    refine_threshold=0.7,
    refine_max_iters=3
)

Result (Score: 0.74, 15 seconds):

Late actor and martial artist Bruce Lee playing table tennis with a set of nunchucks

Improvements: ✅ Added context (“Late actor and martial artist”) ✅ More precise language (“set of nunchucks”) ✅ Self-contained claim

Model Compatibility

All models support Cross-Refine through the unified DeepEval interface (from deepeval_model.py:27-56):

DeepEval Model Mapping

def getEvalModel(self) -> Union[GPTModel, GeminiModel, AnthropicModel, GrokModel]:
    """Map CheckThat models to DeepEval evaluation models."""
    
    if self.api_provider == 'OPENAI':
        return GPTModel(model=self.model, _openai_api_key=self.api_key)
        
    elif self.api_provider == 'XAI':
        return GrokModel(model=self.model, api_key=self.api_key)
        
    elif self.api_provider == 'ANTHROPIC':
        return AnthropicModel(model=self.model, _anthropic_api_key=self.api_key)
        
    elif self.api_provider == 'GEMINI':
        return GeminiModel(model=self.model, api_key=self.api_key)

Supported combinations:

Any generation model + Any evaluation model
Cross-provider refinement (OpenAI → Claude → Gemini)
Same-provider different versions (GPT-5 → GPT-5 nano)

Performance Comparison

Speed vs. Quality Trade-offs

Speed Priority
Balanced
Quality Priority
Consensus

Configuration:

model="gpt-5-nano-2025-08-07"              # Fast
refine_model="gpt-5-nano-2025-08-07"       # Fast
refine_threshold=0.6
refine_max_iters=1

Metrics:

Latency: 3-5 seconds
Cost: $
Quality Score: 0.60-0.65
Best for: High-volume, low-priority claims

Configuration:

model="gpt-5-2025-08-07"                   # Quality
refine_model="claude-sonnet-4-20250514"    # Different provider
refine_threshold=0.7
refine_max_iters=2

Metrics:

Latency: 8-12 seconds
Cost: $$
Quality Score: 0.70-0.75
Best for: Production fact-checking

Configuration:

model="o3-2025-04-16"                       # Reasoning
refine_model="claude-opus-4-1-20250805"    # Premium evaluation
refine_threshold=0.8
refine_max_iters=4

Metrics:

Latency: 20-35 seconds
Cost: $$$$
Quality Score: 0.80-0.88
Best for: Legal, medical, high-stakes claims

Configuration:

models=["gpt-5-2025-08-07", "claude-opus-4-1-20250805", 
        "gemini-2.5-pro", "grok-4-0709"]
refine_model="gpt-5-2025-08-07"  # Same evaluator
refine_threshold=0.75
refine_max_iters=2

Metrics:

Latency: 12-18 seconds (parallel)
Cost: $$$$
Quality Score: 0.75-0.82 (consensus)
Best for: Contentious claims, research

Best Practices

Start Simple

Begin with single-model, then add cross-refine if needed

Match Use Case

High-stakes claims deserve premium model combinations

Monitor Costs

Multiple models multiply costs - track ROI

A/B Test

Compare model combinations on your specific data

For production systems, implement a tiered approach: fast models for initial filtering, cross-refine with premium models for high-priority claims.

Advanced Patterns

Pattern 1: Specialist Cascade

def specialist_cascade(post: str, domain: str):
    """Use specialized models based on domain."""
    client = CheckThat(api_key="your-api-key")
    
    # Domain-specific model selection
    specialist_map = {
        "medical": "claude-opus-4-1-20250805",  # Safety-focused
        "legal": "gpt-5-2025-08-07",            # Reasoning
        "multilingual": "gemini-2.5-pro",       # Language support
        "technical": "o3-2025-04-16",           # Deep reasoning
    }
    
    generation_model = specialist_map.get(domain, "gpt-5-2025-08-07")
    evaluation_model = "claude-opus-4-1-20250805"  # Always use Claude for eval
    
    return client.chat.completions.create(
        model=generation_model,
        messages=[{"role": "user", "content": post}],
        refine_claims=True,
        refine_model=evaluation_model,
        refine_threshold=0.75,
        refine_max_iters=3
    )

Pattern 2: Quality Threshold Escalation

async def escalating_refinement(post: str):
    """Escalate to better models if quality threshold not met."""
    client = CheckThat(api_key="your-api-key")
    
    model_tiers = [
        ("gpt-5-nano-2025-08-07", "gpt-5-nano-2025-08-07", 0.65),
        ("gpt-5-2025-08-07", "claude-sonnet-4-20250514", 0.75),
        ("o3-2025-04-16", "claude-opus-4-1-20250805", 0.85),
    ]
    
    for gen_model, eval_model, threshold in model_tiers:
        response = await client.chat.completions.create(
            model=gen_model,
            messages=[{"role": "user", "content": post}],
            refine_claims=True,
            refine_model=eval_model,
            refine_threshold=threshold,
            refine_max_iters=2
        )
        
        final_score = response.refinement_metadata.refinement_history[-1].score
        
        if final_score >= threshold:
            return response  # Success at this tier
    
    return response  # Return best attempt

Get Started

Core Concepts

Web Application

Guides

Deployment

Cross-Refine with Multiple Models

Concept Overview

Why Multiple Models?

OpenAI (GPT-5)

Anthropic (Claude)

Google (Gemini)

Supported Models

Basic Cross-Refine Pattern

Two-Model Approach

Three-Model Approach

Model Selection Strategies

Strategy 1: Complementary Strengths

Strategy 2: Consensus Building

Strategy 3: Ensemble Refinement

Real-World Examples

Example 1: Medical Misinformation

Example 2: Celebrity Content

Model Compatibility

Performance Comparison

Speed vs. Quality Trade-offs

Best Practices

Start Simple

Match Use Case

Monitor Costs

A/B Test

Advanced Patterns

Pattern 1: Specialist Cascade

Pattern 2: Quality Threshold Escalation

Next Steps

Custom Evaluation

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Web Application

Guides

Deployment

​Concept Overview

​Why Multiple Models?

OpenAI (GPT-5)

Anthropic (Claude)

Google (Gemini)

​Supported Models

​Basic Cross-Refine Pattern

​Two-Model Approach

​Three-Model Approach

​Model Selection Strategies

​Strategy 1: Complementary Strengths

​Strategy 2: Consensus Building

​Strategy 3: Ensemble Refinement

​Real-World Examples

​Example 1: Medical Misinformation

​Example 2: Celebrity Content

​Model Compatibility

​Performance Comparison

​Speed vs. Quality Trade-offs

​Best Practices

Start Simple

Match Use Case

Monitor Costs

A/B Test

​Advanced Patterns

​Pattern 1: Specialist Cascade

​Pattern 2: Quality Threshold Escalation

​Next Steps

Custom Evaluation

API Reference

Build docs developers (and LLMs) love

Concept Overview

Why Multiple Models?

Supported Models

Basic Cross-Refine Pattern

Two-Model Approach

Three-Model Approach

Model Selection Strategies

Strategy 1: Complementary Strengths

Strategy 2: Consensus Building

Strategy 3: Ensemble Refinement

Real-World Examples

Example 1: Medical Misinformation

Example 2: Celebrity Content

Model Compatibility

Performance Comparison

Speed vs. Quality Trade-offs

Best Practices

Advanced Patterns

Pattern 1: Specialist Cascade

Pattern 2: Quality Threshold Escalation

Next Steps