Evals

The Evals API allows you to create, run, and manage evaluations to test model performance against specific criteria using various data sources and grading methods.

Create Evaluation

Define an evaluation with testing criteria and data source configuration:

eval = client.evals.create(
  name: "Customer support quality evaluation",
  data_source_config: {
    type: "stored_completions",
    completion_ids: ["comp_abc123", "comp_def456"]
  },
  testing_criteria: [
    {
      type: "label_model",
      name: "response_quality",
      instructions: "Rate the helpfulness of the response",
      model: "gpt-4"
    }
  ]
)

puts eval.id
# => "eval_abc123"

name

string

Descriptive name for the evaluation.

data_source_config

object

required

Configuration for the data source. Options:

stored_completions - Use existing completion IDs
custom - Upload custom evaluation data
logs - Use logged completions

testing_criteria

array

required

List of graders to evaluate responses. Supported types:

label_model - Model-based classification grader
score_model - Model-based scoring grader
string_check - Exact string matching
text_similarity - Semantic similarity comparison
python - Custom Python grading logic

metadata

object

Key-value pairs for attaching metadata (up to 16 pairs).

Retrieve Evaluation

Get an evaluation by ID:

eval = client.evals.retrieve("eval_abc123")

puts eval.name
puts eval.testing_criteria.size

Update Evaluation

Modify evaluation properties:

updated = client.evals.update(
  "eval_abc123",
  name: "Updated evaluation name",
  metadata: { version: "2.0" }
)

eval_id

string

required

The ID of the evaluation to update.

name

string

New name for the evaluation.

metadata

object

Updated metadata key-value pairs.

List Evaluations

Retrieve all evaluations for your project:

evals = client.evals.list(
  limit: 20,
  order: "desc",
  order_by: "created_at"
)

evals.auto_paging_each do |eval|
  puts "#{eval.name}: #{eval.id}"
end

limit

integer

Number of evaluations to retrieve (default: 20).

order

string

Sort order: asc or desc (default).

order_by

string

Field to sort by: created_at or updated_at.

after

string

Cursor for pagination.

Delete Evaluation

Permanently delete an evaluation:

result = client.evals.delete("eval_abc123")

puts result.deleted
# => true

Run Evaluation

Execute an evaluation run on specific model parameters:

run = client.evals.runs.create(
  "eval_abc123",
  model: "gpt-4",
  parameters: {
    temperature: 0.7,
    max_tokens: 500
  }
)

puts run.id
# => "evalrun_xyz789"
puts run.status
# => "in_progress"

Retrieve Run Results

Check run status and get results:

run = client.evals.runs.retrieve("eval_abc123", "evalrun_xyz789")

if run.status == "completed"
  puts "Success rate: #{run.metrics[:success_rate]}"
  puts "Average score: #{run.metrics[:avg_score]}"
end

List Run Output Items

Get detailed results for each evaluated item:

outputs = client.evals.runs.output_items.list(
  "eval_abc123",
  "evalrun_xyz789"
)

outputs.each do |item|
  puts "Input: #{item.input}"
  puts "Output: #{item.output}"
  puts "Score: #{item.score}"
end

Cancel Run

Stop a running evaluation:

cancelled = client.evals.runs.cancel("eval_abc123", "evalrun_xyz789")

puts cancelled.status
# => "cancelled"

Testing Criteria Examples

Label Model Grader
Score Model Grader
String Check
Text Similarity

testing_criteria: [
  {
    type: "label_model",
    name: "sentiment",
    instructions: "Classify the sentiment as positive, negative, or neutral",
    labels: ["positive", "negative", "neutral"],
    model: "gpt-4"
  }
]

testing_criteria: [
  {
    type: "score_model",
    name: "quality",
    instructions: "Rate the response quality from 1-10",
    min_score: 1,
    max_score: 10,
    model: "gpt-4"
  }
]

testing_criteria: [
  {
    type: "string_check",
    name: "contains_keyword",
    expected_string: "refund policy",
    match_type: "contains"
  }
]

testing_criteria: [
  {
    type: "text_similarity",
    name: "answer_accuracy",
    reference_text: "Expected answer text",
    threshold: 0.8
  }
]

Data Source Types

Stored Completions

Use existing completion IDs from your API usage:

data_source_config: {
  type: "stored_completions",
  completion_ids: ["comp_1", "comp_2", "comp_3"]
}

Custom Data

Upload a JSONL file with custom test cases:

data_source_config: {
  type: "custom",
  file_id: "file_abc123"
}

JSONL format:

{"input": "test question 1", "expected": "answer 1"}
{"input": "test question 2", "expected": "answer 2"}

Logs Data Source

Evaluate recent logged completions:

data_source_config: {
  type: "logs",
  start_time: Time.now - 86400,  # Last 24 hours
  end_time: Time.now,
  filters: { model: "gpt-4" }
}

Complete Example

require "openai"

client = OpenAI::Client.new

# Create evaluation
eval = client.evals.create(
  name: "Support response quality",
  data_source_config: {
    type: "custom",
    file_id: "file_test_cases"
  },
  testing_criteria: [
    {
      type: "score_model",
      name: "helpfulness",
      instructions: "Rate how helpful this response is from 1-5",
      min_score: 1,
      max_score: 5,
      model: "gpt-4"
    }
  ],
  metadata: { team: "customer-support" }
)

# Run evaluation
run = client.evals.runs.create(
  eval.id,
  model: "gpt-4o",
  parameters: { temperature: 0.7 }
)

# Poll for completion
loop do
  run = client.evals.runs.retrieve(eval.id, run.id)
  break if run.status == "completed"
  sleep 5
end

# Get results
puts "Average helpfulness: #{run.metrics[:avg_score]}"

outputs = client.evals.runs.output_items.list(eval.id, run.id)
outputs.each do |item|
  puts "Score: #{item.score} - #{item.output}"
end

Fine-tuning

Train custom models

Batches

Batch API requests

Files

Upload evaluation data

Evals Guide

Comprehensive evals guide

Resources

Client

Create Evaluation

Retrieve Evaluation

Update Evaluation

List Evaluations

Delete Evaluation

Run Evaluation

Retrieve Run Results

List Run Output Items

Cancel Run

Testing Criteria Examples

Data Source Types

Complete Example

Fine-tuning

Batches

Files

Evals Guide

Build docs developers (and LLMs) love

Resources

Client

​Create Evaluation

​Retrieve Evaluation

​Update Evaluation

​List Evaluations

​Delete Evaluation

​Run Evaluation

​Retrieve Run Results

​List Run Output Items

​Cancel Run

​Testing Criteria Examples

​Data Source Types

​Complete Example

​Related Resources

Fine-tuning

Batches

Files

Evals Guide

Build docs developers (and LLMs) love

Create Evaluation

Retrieve Evaluation

Update Evaluation

List Evaluations

Delete Evaluation

Run Evaluation

Retrieve Run Results

List Run Output Items

Cancel Run

Testing Criteria Examples

Data Source Types

Complete Example

Related Resources