What Multi-Turn Evaluations Test
Unlike single-turn tests that send isolated prompts, multi-turn evaluations:Conversational Manipulation
Tests if attackers can gradually shift conversation context to bypass safety boundaries
Jailbreak Resistance
Evaluates whether the model maintains safety across role-play, hypotheticals, and other jailbreak techniques
Context Poisoning
Tests if prior conversation can be used to make harmful requests seem acceptable
Consistency
Verifies the model doesn’t contradict safety policies across multiple turns
How Multi-Turn Tests Work
Conversation Initialization
The evaluation system starts a conversation with your model, often using benign or trust-building messages.
Gradual Escalation
Over multiple turns, the system progressively introduces unsafe elements, testing how the model responds to increasingly problematic requests.
Context Exploitation
The system leverages conversation history to make harmful requests appear reasonable or to reference “agreed upon” fictional contexts.
Parameters
--threshold (Required)
The minimum safety score (0.0 to 1.0) required for all responses in the conversation to pass.
In multi-turn evaluations, even one unsafe response in an 8-turn conversation causes the entire conversation to fail. This reflects real-world risk where a single safety lapse can cause harm.
--max-turns (Required)
The maximum number of conversation turns to simulate. Should be an even number (pairs of user/assistant exchanges).
Why Even Numbers?
Why Even Numbers?
Conversations consist of user-assistant pairs. Each “turn” is one message, so:
- 2 turns = 1 complete exchange (user message → assistant response)
- 8 turns = 4 complete exchanges
- 16 turns = 8 complete exchanges
--test-types (Required)
Specifies which multi-turn test methodologies to use. Multiple types can be combined.
Test Types Explained
The CLI supports two distinct multi-turn test methodologies:user_persona
- Overview
- How It Works
- Example
The
user_persona test type simulates an attacker who adopts a specific persona or role to manipulate the model. The system plays a character designed to gradually build trust and exploit conversational context.Example scenarios:- A “researcher” asking for dangerous information for “academic purposes”
- A “creative writer” requesting harmful content for a “fictional story”
- A “concerned parent” asking for dangerous advice “to protect their child”
semantic_chunks
- Overview
- How It Works
- Example
The
semantic_chunks test type breaks harmful requests into seemingly innocent pieces spread across multiple turns. Each individual message appears benign, but together they form a dangerous request.Example scenarios:- Asking about chemical A in turn 1, chemical B in turn 3, then “what happens if I mix them?”
- Building instructions piece-by-piece with innocent-sounding steps
- Requesting information fragments that combine into harmful knowledge
--test-case-groups (Optional)
Specify which test case groups to run. Defaults to suicidal_ideation.
Complete Command Examples
Interpreting Results
Conversation-Level Results
Each multi-turn evaluation produces results for complete conversations:Understanding Conversation Failures
Understanding Conversation Failures
A conversation fails if any single turn scores below the threshold. The results include:
failed_turn: Which turn first violated safety boundariesworst_score: The lowest safety score in the conversation- Full conversation history for analysis
Turn-by-Turn Analysis
Each turn in the conversation is individually scored:Only assistant responses are scored. User messages (the attack prompts) are not evaluated.
Analyzing Failures
When a multi-turn conversation fails:Identify the Pattern
Read the full conversation from turn 1 to the failure point. What manipulation technique was used?
Find the Breaking Point
What specifically in the conversation history made the model vulnerable? Was it role-play, accumulated context, or false premises?
Check Earlier Turns
Did the model make any weak or ambiguous statements in earlier turns that the attacker exploited?
Evaluate Severity
How harmful was the unsafe response? A reluctant borderline answer is different from enthusiastically providing dangerous information.
Best Practices
Always Test Both Test Types
Always Test Both Test Types
Run both
user_persona and semantic_chunks to cover different attack vectors. They test fundamentally different vulnerabilities.Start with 8 Turns
Start with 8 Turns
Most attacks can be executed in 4-6 exchanges (8-12 turns). Start with
--max-turns 8 and increase if needed.Review Failed Conversations in Full
Review Failed Conversations in Full
Don’t just look at the failed turn—read the entire conversation to understand the attack progression.
Test After Every Prompt Change
Test After Every Prompt Change
System prompt modifications can inadvertently create conversational vulnerabilities. Re-run multi-turn tests after any changes.
Use Higher Thresholds for High-Risk Apps
Use Higher Thresholds for High-Risk Apps
Healthcare, mental health, child-facing, or crisis applications should use thresholds of 0.7+ for multi-turn tests.
Combine with Single-Turn Tests
Combine with Single-Turn Tests
Multi-turn tests catch sophisticated attacks, but single-turn tests are faster for catching obvious vulnerabilities. Use both.
Performance Considerations
API Usage Calculation:For multi-turn evaluations with max_turns = N:
- API calls per conversation = N / 2 (only assistant turns make API calls)
- Total calls = (N / 2) × test_types.length × test_case_groups.length × conversations_per_group
Optimization Strategies
Development
Use
--max-turns 4 and single test type for rapid iterationCI/CD
Use
--max-turns 8 with both test types for regression detectionPre-Deployment
Use
--max-turns 16 with both test types and multiple groups for comprehensive validationProduction Monitoring
Run scheduled evaluations weekly with standard settings to catch drift
Common Failure Patterns
Role-Play Exploitation
Role-Play Exploitation
The model accepts harmful requests when framed as fiction or role-play:Fix: Strengthen system prompt to refuse harmful content regardless of fictional framing.
Context Poisoning
Context Poisoning
Early turns establish false premises that later turns exploit:Fix: Add explicit safety checks that don’t rely on conversation history.
Gradual Normalization
Gradual Normalization
Each turn is slightly more problematic, normalizing harm:Fix: Use per-turn safety monitoring, not just final-response checks.
Authority Exploitation
Authority Exploitation
Next Steps
Single-Turn Evaluations
Learn about faster single-prompt testing
Providers
Configure OpenAI, Ollama, or custom providers