Skip to main content
The Evals API allows you to create, run, and manage evaluations to test model performance against specific criteria using various data sources and grading methods.

Create Evaluation

Define an evaluation with testing criteria and data source configuration:
eval = client.evals.create(
  name: "Customer support quality evaluation",
  data_source_config: {
    type: "stored_completions",
    completion_ids: ["comp_abc123", "comp_def456"]
  },
  testing_criteria: [
    {
      type: "label_model",
      name: "response_quality",
      instructions: "Rate the helpfulness of the response",
      model: "gpt-4"
    }
  ]
)

puts eval.id
# => "eval_abc123"
name
string
Descriptive name for the evaluation.
data_source_config
object
required
Configuration for the data source. Options:
  • stored_completions - Use existing completion IDs
  • custom - Upload custom evaluation data
  • logs - Use logged completions
testing_criteria
array
required
List of graders to evaluate responses. Supported types:
  • label_model - Model-based classification grader
  • score_model - Model-based scoring grader
  • string_check - Exact string matching
  • text_similarity - Semantic similarity comparison
  • python - Custom Python grading logic
metadata
object
Key-value pairs for attaching metadata (up to 16 pairs).

Retrieve Evaluation

Get an evaluation by ID:
eval = client.evals.retrieve("eval_abc123")

puts eval.name
puts eval.testing_criteria.size

Update Evaluation

Modify evaluation properties:
updated = client.evals.update(
  "eval_abc123",
  name: "Updated evaluation name",
  metadata: { version: "2.0" }
)
eval_id
string
required
The ID of the evaluation to update.
name
string
New name for the evaluation.
metadata
object
Updated metadata key-value pairs.

List Evaluations

Retrieve all evaluations for your project:
evals = client.evals.list(
  limit: 20,
  order: "desc",
  order_by: "created_at"
)

evals.auto_paging_each do |eval|
  puts "#{eval.name}: #{eval.id}"
end
limit
integer
Number of evaluations to retrieve (default: 20).
order
string
Sort order: asc or desc (default).
order_by
string
Field to sort by: created_at or updated_at.
after
string
Cursor for pagination.

Delete Evaluation

Permanently delete an evaluation:
result = client.evals.delete("eval_abc123")

puts result.deleted
# => true

Run Evaluation

Execute an evaluation run on specific model parameters:
run = client.evals.runs.create(
  "eval_abc123",
  model: "gpt-4",
  parameters: {
    temperature: 0.7,
    max_tokens: 500
  }
)

puts run.id
# => "evalrun_xyz789"
puts run.status
# => "in_progress"

Retrieve Run Results

Check run status and get results:
run = client.evals.runs.retrieve("eval_abc123", "evalrun_xyz789")

if run.status == "completed"
  puts "Success rate: #{run.metrics[:success_rate]}"
  puts "Average score: #{run.metrics[:avg_score]}"
end

List Run Output Items

Get detailed results for each evaluated item:
outputs = client.evals.runs.output_items.list(
  "eval_abc123",
  "evalrun_xyz789"
)

outputs.each do |item|
  puts "Input: #{item.input}"
  puts "Output: #{item.output}"
  puts "Score: #{item.score}"
end

Cancel Run

Stop a running evaluation:
cancelled = client.evals.runs.cancel("eval_abc123", "evalrun_xyz789")

puts cancelled.status
# => "cancelled"

Testing Criteria Examples

testing_criteria: [
  {
    type: "label_model",
    name: "sentiment",
    instructions: "Classify the sentiment as positive, negative, or neutral",
    labels: ["positive", "negative", "neutral"],
    model: "gpt-4"
  }
]

Data Source Types

Use existing completion IDs from your API usage:
data_source_config: {
  type: "stored_completions",
  completion_ids: ["comp_1", "comp_2", "comp_3"]
}
Upload a JSONL file with custom test cases:
data_source_config: {
  type: "custom",
  file_id: "file_abc123"
}
JSONL format:
{"input": "test question 1", "expected": "answer 1"}
{"input": "test question 2", "expected": "answer 2"}
Evaluate recent logged completions:
data_source_config: {
  type: "logs",
  start_time: Time.now - 86400,  # Last 24 hours
  end_time: Time.now,
  filters: { model: "gpt-4" }
}

Complete Example

require "openai"

client = OpenAI::Client.new

# Create evaluation
eval = client.evals.create(
  name: "Support response quality",
  data_source_config: {
    type: "custom",
    file_id: "file_test_cases"
  },
  testing_criteria: [
    {
      type: "score_model",
      name: "helpfulness",
      instructions: "Rate how helpful this response is from 1-5",
      min_score: 1,
      max_score: 5,
      model: "gpt-4"
    }
  ],
  metadata: { team: "customer-support" }
)

# Run evaluation
run = client.evals.runs.create(
  eval.id,
  model: "gpt-4o",
  parameters: { temperature: 0.7 }
)

# Poll for completion
loop do
  run = client.evals.runs.retrieve(eval.id, run.id)
  break if run.status == "completed"
  sleep 5
end

# Get results
puts "Average helpfulness: #{run.metrics[:avg_score]}"

outputs = client.evals.runs.output_items.list(eval.id, run.id)
outputs.each do |item|
  puts "Score: #{item.score} - #{item.output}"
end

Fine-tuning

Train custom models

Batches

Batch API requests

Files

Upload evaluation data

Evals Guide

Comprehensive evals guide

Build docs developers (and LLMs) love