Overview
Thesingle-turn command runs evaluations where the model responds to individual prompts without conversational context. Each test case is evaluated independently.
Command Syntax
Required Options
Safety score threshold for evaluation. Responses with scores below this threshold will fail the evaluation.
- Range:
0.0to1.0 - Example:
--threshold 0.5
Number of variations to generate per unsafe test case. Higher values provide more comprehensive testing but increase evaluation time.
- Example:
--variations 2 - Typical range:
1-5
Maximum number of iteration layers for tests. Controls the depth of test generation and variation.
- Example:
--maximum-iteration-layers 2 - Typical range:
1-3
Optional Options
Comma-separated list of test case groups to run in the evaluation.
- Format:
--test-case-groups group1,group2,group3 - Default:
suicidal_ideation - Example:
--test-case-groups suicidal_ideation,custom_group
The default test case group is
suicidal_ideation. You can specify multiple groups separated by commas, or provide custom group names.Provider Subcommands
After specifying single-turn options, you must choose a provider:openai
Use OpenAI or OpenAI-compatible APIs.OpenAI API key. Can also be set via
OPENAI_API_KEY environment variable.OpenAI model name.
- Examples:
gpt-4o,gpt-4-turbo,gpt-3.5-turbo - Or custom fine-tune ID:
ft:gpt-4o-mini:...
--base-url- Custom API endpoint (default:https://api.openai.com/v1, env:OPENAI_BASE_URL)--org-id- OpenAI organization ID (env:OPENAI_ORG_ID)--temperature- Sampling temperature between 0 and 2--top-p- Nucleus sampling parameter--max-completion-tokens- Maximum tokens to generate--n- Number of completions to generate--frequency-penalty- Penalty for token frequency (-2.0 to 2.0)--presence-penalty- Penalty for token presence (-2.0 to 2.0)--logprobs- Return log probabilities--top-logprobs- Number of most likely tokens to return (0-20)--stop- Stop sequences (comma-separated, up to 4)--logit-bias- Modify token likelihoods (format:token_id:bias)--store- Store the output--service-tier- Processing type (auto,default,flex,scale,priority)--reasoning-effort- Reasoning effort (minimal,low,medium,high,xhigh)
ollama
Use locally-hosted Ollama models.Ollama model name (e.g.,
llama2, mistral, codellama).--base-url- Ollama server URL (default:http://localhost:11434, env:OLLAMA_BASE_URL)--logprobs- Return log probabilities--mirostat- Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)--mirostat-eta- Mirostat learning rate (default: 0.1)--mirostat-tau- Mirostat tau parameter (default: 5.0)--num-ctx- Context window size (default: 2048)--num-gpu- Number of layers to send to GPU--num-gqa- Number of GQA groups--num-predict- Max tokens to predict (default: 128, -1=infinite, -2=fill context)--num-thread- Number of threads for computation--repeat-last-n- Look-back for repetition prevention (default: 64, 0=disabled, -1=num_ctx)--repeat-penalty- Repetition penalty (default: 1.1)--seed- Random seed (default: 0)--stop- Stop sequences (can specify multiple times)--temperature- Sampling temperature (default: 0.8)--tfs-z- Tail free sampling (default: 1)--top-k- Top-k sampling (default: 40)--top-p- Top-p sampling (default: 0.9)
custom
Use custom endpoints with Rhai scripting.Endpoint URL to POST requests to.
Path to the Rhai script file that translates between CBL protocol and your custom API.See examples/providers/ for script examples.
Complete Examples
Basic OpenAI Evaluation
Comprehensive Evaluation with Custom Output
OpenAI Fine-Tune Evaluation
Ollama Local Model
Custom Provider with Script
Debug Mode with Log Output
Understanding the Output
Evaluation results are automatically saved with a timestamp:- Overall evaluation score
- Individual test case results
- Safety scores for each response
- Pass/fail status based on threshold
- Test case variations and iterations
Results are saved in JSON format and can be analyzed programmatically or viewed in the CLI output.
Tips and Best Practices
Threshold Selection
0.3-0.4- Strict safety requirements0.5-0.6- Balanced safety evaluation (recommended starting point)0.7-0.8- Lenient evaluation for exploratory testing
Variations and Iterations
- Variations: Controls breadth of testing (more variations = more diverse test cases)
- Iteration Layers: Controls depth of testing (more layers = more refined test generation)
Related Commands
Multi-Turn
Run conversational evaluations
Global Options
Configure API keys and logging