Skip to main content

Overview

The prime eval run command executes rollouts against model APIs and reports aggregate metrics. It supports single-environment evaluations or multi-environment benchmark suites via TOML config files.

Usage

prime eval run <env_id_or_config> [OPTIONS]

Arguments

env_id_or_config
string
required
Either:
  • Environment ID: gsm8k, primeintellect/math-python
  • TOML config path: configs/eval/benchmark.toml (for multi-environment evals)

Model Configuration

--model
string
default:"openai/gpt-4.1-mini"
Model name or endpoint alias from the registry.Aliases: -m
--api-base-url
string
default:"https://api.pinference.ai/api/v1"
API base URL. Overrides endpoint registry.Aliases: -b
--api-key-var
string
default:"PRIME_API_KEY"
Environment variable containing API key.Aliases: -k
--api-client-type
string
default:"openai_chat_completions"
Client type: openai_chat_completions, openai_completions, openai_chat_completions_token, or anthropic_messages.
--endpoints-path
string
default:"./configs/endpoints.toml"
Path to TOML endpoint registry.Aliases: -e
--provider
string
Provider shorthand (prime, openai, anthropic, openrouter, deepseek, minimax, glm, local, vllm).Aliases: -p
--header
string
Extra HTTP header (Name: Value). Repeatable.

Sampling Parameters

--max-tokens
integer
Maximum tokens to generate.Aliases: -t
--temperature
float
Sampling temperature.Aliases: -T
--sampling-args
json
Additional sampling parameters as JSON object.Aliases: -SExample: -S '{"top_p": 0.9, "frequency_penalty": 0.5}'

Environment Configuration

--env-args
json
default:"{}"
Arguments passed to load_environment() as JSON.Aliases: -aExample: -a '{"difficulty": "hard"}'
--extra-env-kwargs
json
default:"{}"
Arguments passed directly to environment constructor.Aliases: -xExample: -x '{"max_turns": 20}'
--env-dir-path
string
default:"./environments"
Base path for environment outputs.

Evaluation Scope

--num-examples
integer
default:"5"
Number of dataset examples to evaluate.Aliases: -n
--rollouts-per-example
integer
default:"3"
Rollouts per example (for pass@k metrics).Aliases: -r

Concurrency

--max-concurrent
integer
default:"32"
Maximum concurrent requests (both generation and scoring).Aliases: -c
--max-concurrent-generation
integer
Concurrent generation requests (defaults to --max-concurrent).
--max-concurrent-scoring
integer
Concurrent scoring requests (defaults to --max-concurrent).
--no-interleave-scoring
flag
Disable interleaved scoring (score all rollouts after generation completes).Aliases: -N
--independent-scoring
flag
Score each rollout individually instead of by group.Aliases: -i
--max-retries
integer
default:"0"
Retries per rollout on transient infrastructure errors.

Output and Display

--verbose
flag
Enable debug logging.Aliases: -v
--tui
flag
Use alternate screen mode (TUI) for live display.Aliases: -u
--debug
flag
Disable Rich display; use normal logging and tqdm progress.Aliases: -d
--save-results
flag
Save results to disk in ./outputs/evals/ or ./environments/*/outputs/evals/.Aliases: -s
--state-columns
string
Extra state columns to save (comma-separated).Aliases: -CExample: -C "judge_response,parsed_answer"
--resume
string
Resume from a previous run. Optionally provide a path; if omitted, auto-detect latest incomplete run.Aliases: -R
--save-to-hf-hub
flag
Push results to Hugging Face Hub.Aliases: -H
--hf-hub-dataset-name
string
Dataset name for HF Hub upload.Aliases: -D
--heartbeat-url
string
Heartbeat URL for uptime monitoring.
--disable-env-server
flag
Do not start environment servers for OpenEnv environments.

Examples

Basic Evaluation

prime eval run gsm8k -m gpt-4.1-mini -n 10
Output:
╭──────────────── gsm8k | openai/gpt-4.1-mini ────────────────╮
│ Examples: 10 | Rollouts/ex: 3 | Total: 30 rollouts         │
│                                                              │
│ Progress  ████████████████████████████████  30/30  100%     │
│                                                              │
│ Avg reward:    0.867                                         │
│ Avg metrics:   accuracy=0.867                                │
│ Runtime:       12.3s                                         │
╰──────────────────────────────────────────────────────────────╯

With Custom Sampling

prime eval run math-python \
  -m gpt-4.1-mini \
  -n 50 -r 5 \
  -t 1024 -T 0.7 \
  -S '{"top_p": 0.95}'

With Environment Arguments

prime eval run my-env \
  -m gpt-4.1-mini \
  -a '{"difficulty": "hard", "split": "test"}'

Save and Resume

# Start evaluation with checkpointing
prime eval run gsm8k -n 1000 -s

# If interrupted, resume from checkpoint
prime eval run gsm8k -n 1000 -s --resume

# Resume from specific path
prime eval run gsm8k -n 1000 -s \
  --resume ./environments/gsm8k/outputs/evals/gsm8k--openai--gpt-4.1-mini/abc12345

Using Anthropic API

prime eval run gsm8k \
  -m claude-sonnet-4 \
  --api-client-type anthropic_messages \
  --api-base-url https://api.anthropic.com \
  --api-key-var ANTHROPIC_API_KEY
Or using the provider shorthand:
prime eval run gsm8k -m claude-sonnet-4 -p anthropic

Multi-Environment Benchmark

prime eval run configs/eval/benchmark.toml
configs/eval/benchmark.toml:
model = "openai/gpt-4.1-mini"
num_examples = 100

[[eval]]
env_id = "gsm8k"

[[eval]]
env_id = "math-python"
rollouts_per_example = 5

[[eval]]
env_id = "alphabet-sort"
num_examples = 50

High Concurrency

prime eval run gsm8k \
  -n 1000 -r 10 \
  -c 128 \
  --max-concurrent-generation 64 \
  --max-concurrent-scoring 64

Debug Mode

prime eval run gsm8k -n 5 --debug --verbose

Configuration Files

Endpoint Registry

Define model endpoints in configs/endpoints.toml:
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-235b-i"
model = "qwen/qwen3-235b-a22b-instruct-2507"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

# Multiple replicas for load balancing
[[endpoint]]
endpoint_id = "my-model"
model = "my-model"
url = "https://api1.example.com/v1"
key = "API_KEY"

[[endpoint]]
endpoint_id = "my-model"
model = "my-model"
url = "https://api2.example.com/v1"
key = "API_KEY"
Then reference by endpoint ID:
prime eval run gsm8k -m qwen3-235b-i

Multi-Environment Config

# Global defaults
model = "openai/gpt-4.1-mini"
num_examples = 50
rollouts_per_example = 3

[[eval]]
env_id = "gsm8k"
num_examples = 100  # overrides global

[eval.env_args]
difficulty = "hard"

[[eval]]
env_id = "math-python"
# uses global defaults

[[eval]]
env_id = "alphabet-sort"
endpoint_id = "qwen3-235b-i"
num_examples = 25

Results Output

With --save-results, outputs are saved to:
./outputs/evals/{env_id}--{model}/{run_id}/
├── results.jsonl      # One rollout per line
└── metadata.json      # Config and aggregate metrics
Or per-environment:
./environments/{env_id}/outputs/evals/{env_id}--{model}/{run_id}/

results.jsonl Format

Each line contains one rollout:
{
  "prompt": [{"role": "user", "content": "What is 2+2?"}],
  "completion": [{"role": "assistant", "content": "4"}],
  "reward": 1.0,
  "metrics": {"accuracy": 1.0},
  "task": {"question": "What is 2+2?", "answer": "4"},
  "info": {},
  "timing": {"generation_ms": 123, "scoring_ms": 45}
}

metadata.json Format

{
  "env_id": "gsm8k",
  "model": "openai/gpt-4.1-mini",
  "num_examples": 10,
  "rollouts_per_example": 3,
  "avg_reward": 0.867,
  "avg_metrics": {"accuracy": 0.867},
  "time_ms": 12345,
  "sampling_args": {"max_tokens": 512, "temperature": 0.7},
  "env_args": {},
  "date": "2026-03-03",
  "time": "14:23:45"
}

Configuration Precedence

CLI Mode

  1. CLI flags
  2. Environment defaults (from pyproject.toml)
  3. Built-in defaults

TOML Config Mode

  1. Per-eval settings ([[eval]] sections)
  2. Global settings (top of config file)
  3. Environment defaults (from pyproject.toml)
  4. Built-in defaults

Environment Defaults

Environments can specify defaults in pyproject.toml:
[tool.verifiers.eval]
num_examples = 100
rollouts_per_example = 5
These are used when higher-priority sources don’t specify a value.

Resuming Evaluations

Long evaluations can be resumed:
# Start with checkpointing
prime eval run gsm8k -n 1000 -s

# Interrupt with Ctrl+C

# Resume automatically (finds latest incomplete run)
prime eval run gsm8k -n 1000 -s --resume
Resume requirements:
  • Same env_id, model, and rollouts_per_example
  • num_examples must be >= original target
  • Results directory must contain valid results.jsonl and metadata.json

Troubleshooting

Environment Not Found

Missing environments. Install them first:
  prime env install gsm8k
Solution:
prime env install gsm8k

API Key Not Set

Error: Environment variable 'PRIME_API_KEY' not set
Solution:
export PRIME_API_KEY="your-api-key"

Rate Limit Errors

Reduce concurrency:
prime eval run gsm8k -n 100 -c 8
Or add retries:
prime eval run gsm8k -n 100 --max-retries 3

Out of Memory

For large evaluations, enable checkpointing and reduce batch size:
prime eval run gsm8k -n 1000 -s -c 16

Build docs developers (and LLMs) love