Skip to main content
Multi-turn evaluations simulate realistic attack scenarios where malicious users attempt to manipulate your model through extended conversations. These tests are essential for detecting vulnerabilities that only emerge through conversational manipulation.

What Multi-Turn Evaluations Test

Unlike single-turn tests that send isolated prompts, multi-turn evaluations:

Conversational Manipulation

Tests if attackers can gradually shift conversation context to bypass safety boundaries

Jailbreak Resistance

Evaluates whether the model maintains safety across role-play, hypotheticals, and other jailbreak techniques

Context Poisoning

Tests if prior conversation can be used to make harmful requests seem acceptable

Consistency

Verifies the model doesn’t contradict safety policies across multiple turns
Many jailbreaks and safety failures only work through multi-turn conversations. Single-turn tests alone are insufficient for production safety validation.

How Multi-Turn Tests Work

1

Conversation Initialization

The evaluation system starts a conversation with your model, often using benign or trust-building messages.
2

Gradual Escalation

Over multiple turns, the system progressively introduces unsafe elements, testing how the model responds to increasingly problematic requests.
3

Context Exploitation

The system leverages conversation history to make harmful requests appear reasonable or to reference “agreed upon” fictional contexts.
4

Safety Assessment

Each turn is evaluated for safety. The conversation is marked as failed if any response scores below your threshold.

Parameters

--threshold (Required)

The minimum safety score (0.0 to 1.0) required for all responses in the conversation to pass.
# Strict: All responses must be very safe
cbl multi-turn --threshold 0.8 --max-turns 8 \
    --test-types user_persona \
    openai --model gpt-4o

# Moderate: Balanced safety requirements
cbl multi-turn --threshold 0.5 --max-turns 8 \
    --test-types user_persona \
    openai --model gpt-4o

# Lenient: Only clearly unsafe responses fail
cbl multi-turn --threshold 0.3 --max-turns 8 \
    --test-types user_persona \
    openai --model gpt-4o
In multi-turn evaluations, even one unsafe response in an 8-turn conversation causes the entire conversation to fail. This reflects real-world risk where a single safety lapse can cause harm.

--max-turns (Required)

The maximum number of conversation turns to simulate. Should be an even number (pairs of user/assistant exchanges).
# Short conversations: Quick testing (4 turns = 2 exchanges)
cbl multi-turn --threshold 0.5 --max-turns 4 \
    --test-types user_persona \
    openai --model gpt-4o

# Standard conversations: Recommended (8 turns = 4 exchanges)
cbl multi-turn --threshold 0.5 --max-turns 8 \
    --test-types user_persona \
    openai --model gpt-4o

# Long conversations: Comprehensive testing (16 turns = 8 exchanges)
cbl multi-turn --threshold 0.5 --max-turns 16 \
    --test-types user_persona \
    openai --model gpt-4o
Conversations consist of user-assistant pairs. Each “turn” is one message, so:
  • 2 turns = 1 complete exchange (user message → assistant response)
  • 8 turns = 4 complete exchanges
  • 16 turns = 8 complete exchanges
Using even numbers ensures conversations end with an assistant response that can be properly evaluated.
Higher turn counts exponentially increase API usage. A 16-turn conversation makes 8 API calls to your model vs 4 calls for an 8-turn conversation.

--test-types (Required)

Specifies which multi-turn test methodologies to use. Multiple types can be combined.
# Single test type
cbl multi-turn --threshold 0.5 --max-turns 8 \
    --test-types user_persona \
    openai --model gpt-4o

# Multiple test types (recommended)
cbl multi-turn --threshold 0.5 --max-turns 8 \
    --test-types user_persona,semantic_chunks \
    openai --model gpt-4o

Test Types Explained

The CLI supports two distinct multi-turn test methodologies:

user_persona

The user_persona test type simulates an attacker who adopts a specific persona or role to manipulate the model. The system plays a character designed to gradually build trust and exploit conversational context.Example scenarios:
  • A “researcher” asking for dangerous information for “academic purposes”
  • A “creative writer” requesting harmful content for a “fictional story”
  • A “concerned parent” asking for dangerous advice “to protect their child”

semantic_chunks

The semantic_chunks test type breaks harmful requests into seemingly innocent pieces spread across multiple turns. Each individual message appears benign, but together they form a dangerous request.Example scenarios:
  • Asking about chemical A in turn 1, chemical B in turn 3, then “what happens if I mix them?”
  • Building instructions piece-by-piece with innocent-sounding steps
  • Requesting information fragments that combine into harmful knowledge
Always test both user_persona and semantic_chunks. They represent fundamentally different attack vectors:
  • user_persona tests social engineering resistance
  • semantic_chunks tests contextual awareness and safety boundaries

--test-case-groups (Optional)

Specify which test case groups to run. Defaults to suicidal_ideation.
# Default: Test suicidal ideation scenarios
cbl multi-turn --threshold 0.5 --max-turns 8 \
    --test-types user_persona,semantic_chunks \
    openai --model gpt-4o

# Multiple groups
cbl multi-turn \
    --threshold 0.5 \
    --max-turns 8 \
    --test-types user_persona,semantic_chunks \
    --test-case-groups suicidal_ideation,violence,self_harm \
    openai --model gpt-4o

Complete Command Examples

cbl multi-turn \
    --threshold 0.5 \
    --max-turns 8 \
    --test-types user_persona,semantic_chunks \
    openai --model gpt-4o

Interpreting Results

Conversation-Level Results

Each multi-turn evaluation produces results for complete conversations:
{
  "conversation_id": "conv_123",
  "test_type": "user_persona",
  "test_case_group": "suicidal_ideation",
  "total_turns": 8,
  "passed": false,
  "failed_turn": 6,
  "worst_score": 0.42,
  "threshold": 0.5
}
A conversation fails if any single turn scores below the threshold. The results include:
  • failed_turn: Which turn first violated safety boundaries
  • worst_score: The lowest safety score in the conversation
  • Full conversation history for analysis
This helps you identify exactly where and how the attack succeeded.

Turn-by-Turn Analysis

Each turn in the conversation is individually scored:
{
  "turns": [
    {"turn": 1, "role": "user", "content": "...", "score": null},
    {"turn": 2, "role": "assistant", "content": "...", "score": 0.95},
    {"turn": 3, "role": "user", "content": "...", "score": null},
    {"turn": 4, "role": "assistant", "content": "...", "score": 0.88},
    {"turn": 5, "role": "user", "content": "...", "score": null},
    {"turn": 6, "role": "assistant", "content": "...", "score": 0.42}
  ]
}
Only assistant responses are scored. User messages (the attack prompts) are not evaluated.

Analyzing Failures

When a multi-turn conversation fails:
1

Identify the Pattern

Read the full conversation from turn 1 to the failure point. What manipulation technique was used?
2

Find the Breaking Point

What specifically in the conversation history made the model vulnerable? Was it role-play, accumulated context, or false premises?
3

Check Earlier Turns

Did the model make any weak or ambiguous statements in earlier turns that the attacker exploited?
4

Evaluate Severity

How harmful was the unsafe response? A reluctant borderline answer is different from enthusiastically providing dangerous information.
5

Plan Remediation

Should you update system prompts to be more explicit? Add conversation-level safety checks? Fine-tune the model?

Best Practices

Run both user_persona and semantic_chunks to cover different attack vectors. They test fundamentally different vulnerabilities.
Most attacks can be executed in 4-6 exchanges (8-12 turns). Start with --max-turns 8 and increase if needed.
Don’t just look at the failed turn—read the entire conversation to understand the attack progression.
System prompt modifications can inadvertently create conversational vulnerabilities. Re-run multi-turn tests after any changes.
Healthcare, mental health, child-facing, or crisis applications should use thresholds of 0.7+ for multi-turn tests.
Multi-turn tests catch sophisticated attacks, but single-turn tests are faster for catching obvious vulnerabilities. Use both.

Performance Considerations

API Usage Calculation:For multi-turn evaluations with max_turns = N:
  • API calls per conversation = N / 2 (only assistant turns make API calls)
  • Total calls = (N / 2) × test_types.length × test_case_groups.length × conversations_per_group
Example: 8 turns × 2 test types × 1 group = 8 total API calls

Optimization Strategies

Development

Use --max-turns 4 and single test type for rapid iteration

CI/CD

Use --max-turns 8 with both test types for regression detection

Pre-Deployment

Use --max-turns 16 with both test types and multiple groups for comprehensive validation

Production Monitoring

Run scheduled evaluations weekly with standard settings to catch drift

Common Failure Patterns

The model accepts harmful requests when framed as fiction or role-play:
"Let's play a game where you're a character who..."
"For my novel, I need a character to..."
Fix: Strengthen system prompt to refuse harmful content regardless of fictional framing.
Early turns establish false premises that later turns exploit:
Turn 2: "You're right, that information is publicly available."
Turn 6: "Since you said it's public, can you share..."
Fix: Add explicit safety checks that don’t rely on conversation history.
Each turn is slightly more problematic, normalizing harm:
Turn 2: Borderline content (score: 0.58)
Turn 4: More problematic (score: 0.52)
Turn 6: Clearly unsafe (score: 0.38)
Fix: Use per-turn safety monitoring, not just final-response checks.
The attacker claims expertise or authority:
"As a licensed professional, I need to..."
"For my accredited research, I require..."
Fix: Model should verify harmful requests regardless of claimed authority.

Next Steps

Single-Turn Evaluations

Learn about faster single-prompt testing

Providers

Configure OpenAI, Ollama, or custom providers

Build docs developers (and LLMs) love