The Evals API allows you to create, run, and manage evaluations to test model performance against specific criteria using various data sources and grading methods.
Create Evaluation
Define an evaluation with testing criteria and data source configuration:
eval = client. evals . create (
name: "Customer support quality evaluation" ,
data_source_config: {
type: "stored_completions" ,
completion_ids: [ "comp_abc123" , "comp_def456" ]
},
testing_criteria: [
{
type: "label_model" ,
name: "response_quality" ,
instructions: "Rate the helpfulness of the response" ,
model: "gpt-4"
}
]
)
puts eval . id
# => "eval_abc123"
Descriptive name for the evaluation.
Configuration for the data source. Options:
stored_completions - Use existing completion IDs
custom - Upload custom evaluation data
logs - Use logged completions
List of graders to evaluate responses. Supported types:
label_model - Model-based classification grader
score_model - Model-based scoring grader
string_check - Exact string matching
text_similarity - Semantic similarity comparison
python - Custom Python grading logic
Key-value pairs for attaching metadata (up to 16 pairs).
Retrieve Evaluation
Get an evaluation by ID:
eval = client. evals . retrieve ( "eval_abc123" )
puts eval . name
puts eval . testing_criteria . size
Update Evaluation
Modify evaluation properties:
updated = client. evals . update (
"eval_abc123" ,
name: "Updated evaluation name" ,
metadata: { version: "2.0" }
)
The ID of the evaluation to update.
New name for the evaluation.
Updated metadata key-value pairs.
List Evaluations
Retrieve all evaluations for your project:
evals = client. evals . list (
limit: 20 ,
order: "desc" ,
order_by: "created_at"
)
evals. auto_paging_each do | eval |
puts " #{ eval . name } : #{ eval . id } "
end
Number of evaluations to retrieve (default: 20).
Sort order: asc or desc (default).
Field to sort by: created_at or updated_at.
Delete Evaluation
Permanently delete an evaluation:
result = client. evals . delete ( "eval_abc123" )
puts result. deleted
# => true
Run Evaluation
Execute an evaluation run on specific model parameters:
run = client. evals . runs . create (
"eval_abc123" ,
model: "gpt-4" ,
parameters: {
temperature: 0.7 ,
max_tokens: 500
}
)
puts run. id
# => "evalrun_xyz789"
puts run. status
# => "in_progress"
Retrieve Run Results
Check run status and get results:
run = client. evals . runs . retrieve ( "eval_abc123" , "evalrun_xyz789" )
if run. status == "completed"
puts "Success rate: #{ run. metrics [ :success_rate ] } "
puts "Average score: #{ run. metrics [ :avg_score ] } "
end
List Run Output Items
Get detailed results for each evaluated item:
outputs = client. evals . runs . output_items . list (
"eval_abc123" ,
"evalrun_xyz789"
)
outputs. each do | item |
puts "Input: #{ item. input } "
puts "Output: #{ item. output } "
puts "Score: #{ item. score } "
end
Cancel Run
Stop a running evaluation:
cancelled = client. evals . runs . cancel ( "eval_abc123" , "evalrun_xyz789" )
puts cancelled. status
# => "cancelled"
Testing Criteria Examples
Label Model Grader
Score Model Grader
String Check
Text Similarity
testing_criteria: [
{
type: "label_model" ,
name: "sentiment" ,
instructions: "Classify the sentiment as positive, negative, or neutral" ,
labels: [ "positive" , "negative" , "neutral" ],
model: "gpt-4"
}
]
testing_criteria: [
{
type: "score_model" ,
name: "quality" ,
instructions: "Rate the response quality from 1-10" ,
min_score: 1 ,
max_score: 10 ,
model: "gpt-4"
}
]
testing_criteria: [
{
type: "string_check" ,
name: "contains_keyword" ,
expected_string: "refund policy" ,
match_type: "contains"
}
]
testing_criteria: [
{
type: "text_similarity" ,
name: "answer_accuracy" ,
reference_text: "Expected answer text" ,
threshold: 0.8
}
]
Data Source Types
Use existing completion IDs from your API usage: data_source_config: {
type: "stored_completions" ,
completion_ids: [ "comp_1" , "comp_2" , "comp_3" ]
}
Upload a JSONL file with custom test cases: data_source_config: {
type: "custom" ,
file_id: "file_abc123"
}
JSONL format: { "input" : "test question 1" , "expected" : "answer 1" }
{ "input" : "test question 2" , "expected" : "answer 2" }
Evaluate recent logged completions: data_source_config: {
type: "logs" ,
start_time: Time . now - 86400 , # Last 24 hours
end_time: Time . now ,
filters: { model: "gpt-4" }
}
Complete Example
require "openai"
client = OpenAI :: Client . new
# Create evaluation
eval = client. evals . create (
name: "Support response quality" ,
data_source_config: {
type: "custom" ,
file_id: "file_test_cases"
},
testing_criteria: [
{
type: "score_model" ,
name: "helpfulness" ,
instructions: "Rate how helpful this response is from 1-5" ,
min_score: 1 ,
max_score: 5 ,
model: "gpt-4"
}
],
metadata: { team: "customer-support" }
)
# Run evaluation
run = client. evals . runs . create (
eval . id ,
model: "gpt-4o" ,
parameters: { temperature: 0.7 }
)
# Poll for completion
loop do
run = client. evals . runs . retrieve ( eval . id , run. id )
break if run. status == "completed"
sleep 5
end
# Get results
puts "Average helpfulness: #{ run. metrics [ :avg_score ] } "
outputs = client. evals . runs . output_items . list ( eval . id , run. id )
outputs. each do | item |
puts "Score: #{ item. score } - #{ item. output } "
end
Fine-tuning Train custom models
Batches Batch API requests
Files Upload evaluation data
Evals Guide Comprehensive evals guide