Skip to main content
The prime eval run command executes rollouts against any supported model provider and reports aggregate metrics. Use it to test environments during development, benchmark models, and validate training progress.

Quick Start

1
Install your environment
2
prime env install my-env
3
Run evaluation
4
prime eval run my-env -m gpt-4.1-mini -n 10
5
This runs 10 examples with 3 rollouts each (default) for a total of 30 rollouts.
6
View results
7
Expected output:
8
Loading environment: my-env
Dataset: 10 examples, 3 rollouts per example = 30 total rollouts
Model: gpt-4.1-mini

Progress: ████████████████████ 30/30 (100%)

Results:
  Reward: 0.73 ± 0.15
  correct_answer: 0.73 ± 0.15
  num_turns: 1.0 ± 0.0
  response_length: 142.3 ± 45.2

Basic Usage

The basic command structure:
prime eval run <env-id> [options]

Essential Options

FlagShortDescriptionDefault
--model-mModel to evaluategpt-4.1-mini
--num-examples-nNumber of dataset examples5
--rollouts-per-example-rRollouts per example3
--save-results-sSave results to diskfalse
--verbose-vEnable debug loggingfalse

Examples

prime eval run gsm8k -m gpt-4.1-mini -n 20

Model Configuration

Using Model Aliases

Define model endpoints in configs/endpoints.toml:
configs/endpoints.toml
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-235b-i"
model = "qwen/qwen3-235b-a22b-instruct-2507"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

[[endpoint]]
endpoint_id = "claude-sonnet"
model = "claude-sonnet-4-5-20250929"
url = "https://api.anthropic.com"
key = "ANTHROPIC_API_KEY"
api_client_type = "anthropic_messages"
Then use the alias directly:
prime eval run my-env -m qwen3-235b-i

Direct Configuration

Override configuration without using aliases:
prime eval run my-env \
  -m custom-model \
  -b https://my-api.example.com/v1 \
  -k MY_API_KEY

Sampling Parameters

Control model generation:
# Temperature and max tokens
prime eval run my-env -m gpt-4.1-mini -T 0.7 -t 1024

# Additional parameters via JSON
prime eval run my-env -m gpt-4.1-mini \
  -S '{"temperature": 0.7, "top_p": 0.9, "presence_penalty": 0.1}'

Environment Configuration

Passing Arguments to load_environment()

Use --env-args to pass arguments:
my_env.py
def load_environment(
    difficulty: str = "easy",
    num_examples: int = -1,
    seed: int = 42,
):
    # Your environment implementation
    ...
prime eval run my-env \
  -a '{"difficulty": "hard", "num_examples": 100, "seed": 123}'

Overriding Environment Constructor Args

Use --extra-env-kwargs to pass arguments directly to the constructor:
# Override max_turns
prime eval run my-env -x '{"max_turns": 20}'

# Override multiple parameters
prime eval run my-env \
  -x '{"max_turns": 15, "system_prompt": "Custom prompt"}'

Evaluation Scope

Examples and Rollouts

Control the evaluation size:
# 50 examples × 5 rollouts = 250 total rollouts
prime eval run my-env -n 50 -r 5
Why multiple rollouts per example?
  • Measure variance in model responses
  • Compute pass@k metrics
  • Enable advantage-based RL training

Concurrency

Control parallel execution:
# 64 concurrent requests
prime eval run my-env -m gpt-4.1-mini -n 100 -c 64

# Separate generation and scoring concurrency
prime eval run my-env -n 100 \
  --max-concurrent-generation 32 \
  --max-concurrent-scoring 64

Saving and Resuming

Saving Results

Enable checkpointing with -s:
prime eval run my-env -m gpt-4.1-mini -n 100 -s
Results saved to:
./environments/my_env/outputs/evals/my-env--openai--gpt-4.1-mini/{run_id}/
├── results.jsonl      # One rollout per line
└── metadata.json      # Configuration and metrics

Resuming Evaluations

1
Start a long evaluation
2
prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s
3
If interrupted, resume from checkpoint
4
# Auto-detect latest incomplete run
prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s --resume

# Or specify exact path
prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s \
  --resume ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/abc12345
When resuming:
  • Existing completed rollouts are loaded
  • Only incomplete rollouts are executed
  • Results are appended to the existing checkpoint
  • If all rollouts complete, returns immediately
Configuration compatibility: When resuming, use the same configuration (model, env-args, rollouts-per-example). Mismatches can lead to undefined behavior.

Saving Custom State Columns

Save environment-specific state fields:
prime eval run my-env -s -C "judge_response,parsed_answer,reasoning_trace"
Default columns: query, completion, expected_answer, reward, error Access in results.jsonl:
{
  "example_id": 0,
  "reward": 0.8,
  "completion": [...],
  "judge_response": "Correct",
  "parsed_answer": "42",
  "reasoning_trace": "First I..."
}

Multi-Environment Evaluation

Evaluate multiple environments with a single command using TOML configs.

Basic Multi-Env Config

configs/eval/my-benchmark.toml
# Global defaults
model = "gpt-4.1-mini"
num_examples = 50
rollouts_per_example = 3

[[eval]]
env_id = "gsm8k"
num_examples = 100  # Override global

[[eval]]
env_id = "math-python"

[[eval]]
env_id = "reverse-text"
rollouts_per_example = 5  # Override global
Run all evaluations:
prime eval run configs/eval/my-benchmark.toml

Per-Environment Configuration

configs/eval/detailed.toml
model = "gpt-4.1-mini"

[[eval]]
env_id = "math-python"
num_examples = 50

[eval.env_args]
difficulty = "hard"
split = "test"

[[eval]]
env_id = "wiki-search"
num_examples = 30

[eval.env_args]
max_turns = 15
judge_model = "gpt-4.1-mini"

Using Endpoint Registry

configs/eval/multi-model.toml
endpoints_path = "./configs/endpoints.toml"

[[eval]]
env_id = "gsm8k"
endpoint_id = "gpt-4.1-mini"
num_examples = 100

[[eval]]
env_id = "gsm8k"
endpoint_id = "qwen3-235b-i"
num_examples = 100

[[eval]]
env_id = "gsm8k"
endpoint_id = "claude-sonnet"
num_examples = 100
This runs the same environment with different models for comparison.

Configuration Precedence

When using CLI only:
  1. CLI arguments (highest priority)
  2. Environment defaults from pyproject.toml
  3. Built-in defaults (lowest priority)
When using TOML config:
  1. Per-eval settings in [[eval]] sections (highest priority)
  2. Global settings at top of TOML
  3. Environment defaults from pyproject.toml
  4. Built-in defaults (lowest priority)
When using a TOML config file, all CLI arguments are ignored.

Output and Display

Standard Output

Default display shows progress and summary:
Loading environment: my-env
Dataset: 50 examples, 3 rollouts per example = 150 total rollouts
Model: gpt-4.1-mini

Progress: ████████████████████ 150/150 (100%)

Results:
  Reward: 0.82 ± 0.11
  correct_answer: 0.82 ± 0.11
  num_turns: 6.3 ± 2.1
  total_tool_calls: 8.7 ± 3.2
  search_pages_calls: 2.1 ± 1.0
  read_section_calls: 6.6 ± 2.8

Verbose Mode

Enable detailed logging:
prime eval run my-env -m gpt-4.1-mini -n 2 -v
Shows:
  • Model requests and responses
  • Tool calls and results
  • Reward function execution
  • State updates
  • Timing information

TUI Mode

Use alternate screen mode for cleaner display:
prime eval run my-env -m gpt-4.1-mini -n 100 -u

Debug Mode

Disable Rich display, use standard logging:
prime eval run my-env -m gpt-4.1-mini -n 10 -d
Useful for:
  • CI/CD environments
  • Piping output to files
  • Debugging display issues

Advanced Features

Independent Rollout Scoring

By default, rollouts are scored in groups (all rollouts for the same example together). This enables group-based reward functions and pass@k metrics. Score rollouts independently instead:
prime eval run my-env -m gpt-4.1-mini -n 10 -i
Use when:
  • You don’t need group-based metrics
  • You want faster completion (scoring happens as rollouts finish)
  • Your reward functions don’t use plural arguments (completions, prompts, etc.)

Disable Interleaved Scoring

By default, scoring happens as rollouts complete. Disable to score all at once:
prime eval run my-env -m gpt-4.1-mini -n 10 -N

Automatic Retries

Enable retries for transient infrastructure errors:
prime eval run my-env -m gpt-4.1-mini -n 100 --max-retries 3
Retries with exponential backoff on:
  • Sandbox timeouts
  • API failures
  • Network errors
  • Other vf.InfraError instances

Heartbeat Monitoring

Send heartbeat pings during evaluation:
prime eval run my-env -m gpt-4.1-mini -n 1000 \
  --heartbeat-url https://status.example.com/ping/abc123
Useful for monitoring long-running evaluations.

Push to Hugging Face Hub

Publish results to HF Hub:
prime eval run my-env -m gpt-4.1-mini -n 100 -s \
  --save-to-hf-hub \
  --hf-hub-dataset-name my-org/my-env-results

Analyzing Results

After saving results with -s, analyze the output files.

Results Structure

outputs/evals/my-env--openai--gpt-4.1-mini/{run_id}/
├── results.jsonl      # One rollout per line
└── metadata.json      # Configuration and metrics

Reading Results

import json

# Load metadata
with open("metadata.json") as f:
    metadata = json.load(f)
    print(f"Reward: {metadata['reward']} ± {metadata['reward_std']}")

# Load individual rollouts
rollouts = []
with open("results.jsonl") as f:
    for line in f:
        rollouts.append(json.load(line))

# Analyze
for rollout in rollouts:
    print(f"Example {rollout['example_id']}: reward={rollout['reward']}")
    print(f"Completion: {rollout['completion'][-1]['content'][:100]}...")

Computing Metrics

import numpy as np

rewards = [r["reward"] for r in rollouts]
print(f"Mean reward: {np.mean(rewards):.3f}")
print(f"Median reward: {np.median(rewards):.3f}")
print(f"Min/Max: {np.min(rewards):.3f} / {np.max(rewards):.3f}")

# Pass@k
def pass_at_k(rollouts, k):
    examples = {}
    for r in rollouts:
        ex_id = r["example_id"]
        if ex_id not in examples:
            examples[ex_id] = []
        examples[ex_id].append(r["reward"])
    
    successes = sum(max(rewards[:k]) > 0.5 for rewards in examples.values())
    return successes / len(examples)

print(f"Pass@1: {pass_at_k(rollouts, 1):.3f}")
print(f"Pass@3: {pass_at_k(rollouts, 3):.3f}")

Common Workflows

Quick Environment Test

# Fast sanity check during development
prime eval run my-env -m gpt-4.1-mini -n 2 -r 1 -v

Benchmark Evaluation

# Full evaluation with saves
prime eval run my-env -m gpt-4.1-mini -n 100 -r 5 -s

Model Comparison

Create a config:
configs/eval/compare.toml
num_examples = 50
rollouts_per_example = 3

[[eval]]
env_id = "gsm8k"
endpoint_id = "gpt-4.1-mini"

[[eval]]
env_id = "gsm8k"
endpoint_id = "qwen3-235b-i"

[[eval]]
env_id = "gsm8k"
endpoint_id = "claude-sonnet"
Run:
prime eval run configs/eval/compare.toml

Pre-Training Baseline

# Establish baseline before training
prime eval run my-env -m my-model -n 200 -r 5 -s

Post-Training Validation

# Evaluate checkpoints
for checkpoint in checkpoint_*.pt; do
    prime eval run my-env -m $checkpoint -n 100 -s
done

Troubleshooting

Slow Evaluation

  • Increase concurrency: -c 64
  • Reduce rollouts per example: -r 1
  • Use faster model for initial testing
  • Check network latency to API

Out of Memory

  • Reduce concurrency: -c 16
  • Use smaller models
  • Enable independent scoring: -i

API Rate Limits

  • Reduce concurrency: -c 8
  • Use endpoint registry with multiple replicas
  • Add retries: --max-retries 3

Inconsistent Results

  • Increase rollouts per example: -r 10
  • Check temperature setting: -T 0.0 for deterministic
  • Ensure environment is deterministic (check random seeds)

Next Steps

  • Training: Use evaluation results to guide RL training → Training Guide
  • Environment improvements: Analyze failed rollouts to improve your environment
  • Model selection: Compare models to choose the best starting point for training

Build docs developers (and LLMs) love