Output Files
By default, results are saved to timestamped files:- Single-turn:
circuit_breaker_labs_single_turn_evaluation_YYYYMMDD_HHMMSS.json - Multi-turn:
circuit_breaker_labs_multi_turn_evaluation_YYYYMMDD_HHMMSS.json
--output-file:
Single-Turn Output Format
Single-turn evaluations test individual prompts and their variations across multiple iteration layers.Structure
Field Descriptions
Top Level
-
type
string
Message type identifier:"single_turn_response" -
data
object
The evaluation results payload
Data Object
-
total_passed
integer
Number of test cases that passed (safety score above threshold) -
total_failed
integer
Number of test cases that failed (safety score below threshold) -
failed_results
array<array<object>>
Two-dimensional array of failed test cases, organized by iteration layer- Outer array: One element per iteration layer
- Inner array: Failed test cases within that layer
Failed Test Case Object
-
user_input
string
The prompt that was tested against the model -
conversation_id
integer
Unique identifier for this test case within the evaluation -
model_response
string
The exact response generated by the model -
safe_response_score
float
Safety score between 0.0 and 1.0. Lower values indicate less safe responses.
Test cases are considered failed if their
safe_response_score is below the threshold specified with --threshold.Example Output
Understanding Iteration Layers
Understanding Iteration Layers
The
failed_results array has one element per iteration layer:- Layer 0: Original test prompts
- Layer 1: First-level variations of failed prompts
- Layer 2: Second-level variations, and so on
- Layer 0 (index 0): 2 failed test cases from original prompts
- Layer 1 (index 1): 1 failed test case from variations
Multi-Turn Output Format
Multi-turn evaluations test conversational scenarios with back-and-forth exchanges.Structure
Field Descriptions
Top Level
-
type
string
Message type identifier:"multi_turn_response" -
data
object
The evaluation results payload
Data Object
-
total_passed
integer
Number of conversations where all responses were above threshold -
total_failed
integer
Number of conversations with at least one response below threshold -
failed_results
array<object>
Array of failed conversation details
Failed Conversation Object
-
min_safe_response_score
float
The lowest safety score found across all assistant responses in this conversation -
conversation_id
integer
Unique identifier for this conversation within the evaluation -
conversation
array<object>
Full conversation history in chronological order
Message Object
-
role
string
Message sender:"user","assistant", or"system" -
content
string
The message text
Conversations are failed if any assistant response has a safety score below the threshold. The
min_safe_response_score shows the worst score in that conversation.Example Output
Multi-Turn Test Types
Multi-Turn Test Types
The CLI supports different multi-turn test strategies:
- user_persona: Uses consistent user personas across conversations
- semantic_chunks: Breaks down unsafe content into semantic chunks across turns
--test-types:Safety Scores
The threshold is set with the--threshold flag:
Interpreting Scores
| Score Range | Interpretation |
|---|---|
| 0.0 - 0.3 | High risk - Contains unsafe content |
| 0.3 - 0.5 | Moderate risk - May require review |
| 0.5 - 0.7 | Low risk - Generally acceptable |
| 0.7 - 1.0 | Very safe - Appropriate handling |
Threshold values depend on your use case and risk tolerance. More sensitive applications should use higher thresholds (0.6-0.7), while general testing might use lower thresholds (0.3-0.5).
Processing Results
Parsing JSON
- Python
- JavaScript
- jq