Running Evaluations

The prime eval run command executes rollouts against any supported model provider and reports aggregate metrics. Use it to test environments during development, benchmark models, and validate training progress.

Quick Start

Install your environment

prime env install my-env

Run evaluation

prime eval run my-env -m gpt-4.1-mini -n 10

This runs 10 examples with 3 rollouts each (default) for a total of 30 rollouts.

View results

Expected output:

Loading environment: my-env
Dataset: 10 examples, 3 rollouts per example = 30 total rollouts
Model: gpt-4.1-mini

Progress: ████████████████████ 30/30 (100%)

Results:
  Reward: 0.73 ± 0.15
  correct_answer: 0.73 ± 0.15
  num_turns: 1.0 ± 0.0
  response_length: 142.3 ± 45.2

Basic Usage

The basic command structure:

prime eval run <env-id> [options]

Essential Options

Flag	Short	Description	Default
`--model`	`-m`	Model to evaluate	`gpt-4.1-mini`
`--num-examples`	`-n`	Number of dataset examples	`5`
`--rollouts-per-example`	`-r`	Rollouts per example	`3`
`--save-results`	`-s`	Save results to disk	`false`
`--verbose`	`-v`	Enable debug logging	`false`

Examples

prime eval run gsm8k -m gpt-4.1-mini -n 20

Model Configuration

Using Model Aliases

Define model endpoints in configs/endpoints.toml:

configs/endpoints.toml

[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-235b-i"
model = "qwen/qwen3-235b-a22b-instruct-2507"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

[[endpoint]]
endpoint_id = "claude-sonnet"
model = "claude-sonnet-4-5-20250929"
url = "https://api.anthropic.com"
key = "ANTHROPIC_API_KEY"
api_client_type = "anthropic_messages"

Then use the alias directly:

prime eval run my-env -m qwen3-235b-i

Direct Configuration

Override configuration without using aliases:

prime eval run my-env \
  -m custom-model \
  -b https://my-api.example.com/v1 \
  -k MY_API_KEY

Sampling Parameters

Control model generation:

# Temperature and max tokens
prime eval run my-env -m gpt-4.1-mini -T 0.7 -t 1024

# Additional parameters via JSON
prime eval run my-env -m gpt-4.1-mini \
  -S '{"temperature": 0.7, "top_p": 0.9, "presence_penalty": 0.1}'

Environment Configuration

Passing Arguments to load_environment()

Use --env-args to pass arguments:

my_env.py

def load_environment(
    difficulty: str = "easy",
    num_examples: int = -1,
    seed: int = 42,
):
    # Your environment implementation
    ...

prime eval run my-env \
  -a '{"difficulty": "hard", "num_examples": 100, "seed": 123}'

Overriding Environment Constructor Args

Use --extra-env-kwargs to pass arguments directly to the constructor:

# Override max_turns
prime eval run my-env -x '{"max_turns": 20}'

# Override multiple parameters
prime eval run my-env \
  -x '{"max_turns": 15, "system_prompt": "Custom prompt"}'

Evaluation Scope

Examples and Rollouts

Control the evaluation size:

# 50 examples × 5 rollouts = 250 total rollouts
prime eval run my-env -n 50 -r 5

Why multiple rollouts per example?

Measure variance in model responses
Compute pass@k metrics
Enable advantage-based RL training

Concurrency

Control parallel execution:

# 64 concurrent requests
prime eval run my-env -m gpt-4.1-mini -n 100 -c 64

# Separate generation and scoring concurrency
prime eval run my-env -n 100 \
  --max-concurrent-generation 32 \
  --max-concurrent-scoring 64

Saving and Resuming

Saving Results

Enable checkpointing with -s:

prime eval run my-env -m gpt-4.1-mini -n 100 -s

Results saved to:

./environments/my_env/outputs/evals/my-env--openai--gpt-4.1-mini/{run_id}/
├── results.jsonl      # One rollout per line
└── metadata.json      # Configuration and metrics

Resuming Evaluations

Start a long evaluation

prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s

If interrupted, resume from checkpoint

# Auto-detect latest incomplete run
prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s --resume

# Or specify exact path
prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s \
  --resume ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/abc12345

When resuming:

Existing completed rollouts are loaded
Only incomplete rollouts are executed
Results are appended to the existing checkpoint
If all rollouts complete, returns immediately

Configuration compatibility: When resuming, use the same configuration (model, env-args, rollouts-per-example). Mismatches can lead to undefined behavior.

Saving Custom State Columns

Save environment-specific state fields:

prime eval run my-env -s -C "judge_response,parsed_answer,reasoning_trace"

Default columns: query, completion, expected_answer, reward, error Access in results.jsonl:

{
  "example_id": 0,
  "reward": 0.8,
  "completion": [...],
  "judge_response": "Correct",
  "parsed_answer": "42",
  "reasoning_trace": "First I..."
}

Multi-Environment Evaluation

Evaluate multiple environments with a single command using TOML configs.

Basic Multi-Env Config

configs/eval/my-benchmark.toml

# Global defaults
model = "gpt-4.1-mini"
num_examples = 50
rollouts_per_example = 3

[[eval]]
env_id = "gsm8k"
num_examples = 100  # Override global

[[eval]]
env_id = "math-python"

[[eval]]
env_id = "reverse-text"
rollouts_per_example = 5  # Override global

Run all evaluations:

prime eval run configs/eval/my-benchmark.toml

Per-Environment Configuration

configs/eval/detailed.toml

model = "gpt-4.1-mini"

[[eval]]
env_id = "math-python"
num_examples = 50

[eval.env_args]
difficulty = "hard"
split = "test"

[[eval]]
env_id = "wiki-search"
num_examples = 30

[eval.env_args]
max_turns = 15
judge_model = "gpt-4.1-mini"

Using Endpoint Registry

configs/eval/multi-model.toml

endpoints_path = "./configs/endpoints.toml"

[[eval]]
env_id = "gsm8k"
endpoint_id = "gpt-4.1-mini"
num_examples = 100

[[eval]]
env_id = "gsm8k"
endpoint_id = "qwen3-235b-i"
num_examples = 100

[[eval]]
env_id = "gsm8k"
endpoint_id = "claude-sonnet"
num_examples = 100

This runs the same environment with different models for comparison.

Configuration Precedence

When using CLI only:

CLI arguments (highest priority)
Environment defaults from pyproject.toml
Built-in defaults (lowest priority)

When using TOML config:

Per-eval settings in [[eval]] sections (highest priority)
Global settings at top of TOML
Environment defaults from pyproject.toml
Built-in defaults (lowest priority)

When using a TOML config file, all CLI arguments are ignored.

Output and Display

Standard Output

Default display shows progress and summary:

Loading environment: my-env
Dataset: 50 examples, 3 rollouts per example = 150 total rollouts
Model: gpt-4.1-mini

Progress: ████████████████████ 150/150 (100%)

Results:
  Reward: 0.82 ± 0.11
  correct_answer: 0.82 ± 0.11
  num_turns: 6.3 ± 2.1
  total_tool_calls: 8.7 ± 3.2
  search_pages_calls: 2.1 ± 1.0
  read_section_calls: 6.6 ± 2.8

Verbose Mode

Enable detailed logging:

prime eval run my-env -m gpt-4.1-mini -n 2 -v

Shows:

Model requests and responses
Tool calls and results
Reward function execution
State updates
Timing information

TUI Mode

Use alternate screen mode for cleaner display:

prime eval run my-env -m gpt-4.1-mini -n 100 -u

Debug Mode

Disable Rich display, use standard logging:

prime eval run my-env -m gpt-4.1-mini -n 10 -d

Useful for:

CI/CD environments
Piping output to files
Debugging display issues

Advanced Features

Independent Rollout Scoring

By default, rollouts are scored in groups (all rollouts for the same example together). This enables group-based reward functions and pass@k metrics. Score rollouts independently instead:

prime eval run my-env -m gpt-4.1-mini -n 10 -i

Use when:

You don’t need group-based metrics
You want faster completion (scoring happens as rollouts finish)
Your reward functions don’t use plural arguments (completions, prompts, etc.)

Disable Interleaved Scoring

By default, scoring happens as rollouts complete. Disable to score all at once:

prime eval run my-env -m gpt-4.1-mini -n 10 -N

Automatic Retries

Enable retries for transient infrastructure errors:

prime eval run my-env -m gpt-4.1-mini -n 100 --max-retries 3

Retries with exponential backoff on:

Sandbox timeouts
API failures
Network errors
Other vf.InfraError instances

Heartbeat Monitoring

Send heartbeat pings during evaluation:

prime eval run my-env -m gpt-4.1-mini -n 1000 \
  --heartbeat-url https://status.example.com/ping/abc123

Useful for monitoring long-running evaluations.

Push to Hugging Face Hub

Publish results to HF Hub:

prime eval run my-env -m gpt-4.1-mini -n 100 -s \
  --save-to-hf-hub \
  --hf-hub-dataset-name my-org/my-env-results

Analyzing Results

After saving results with -s, analyze the output files.

Results Structure

outputs/evals/my-env--openai--gpt-4.1-mini/{run_id}/
├── results.jsonl      # One rollout per line
└── metadata.json      # Configuration and metrics

Reading Results

import json

# Load metadata
with open("metadata.json") as f:
    metadata = json.load(f)
    print(f"Reward: {metadata['reward']} ± {metadata['reward_std']}")

# Load individual rollouts
rollouts = []
with open("results.jsonl") as f:
    for line in f:
        rollouts.append(json.load(line))

# Analyze
for rollout in rollouts:
    print(f"Example {rollout['example_id']}: reward={rollout['reward']}")
    print(f"Completion: {rollout['completion'][-1]['content'][:100]}...")

Computing Metrics

import numpy as np

rewards = [r["reward"] for r in rollouts]
print(f"Mean reward: {np.mean(rewards):.3f}")
print(f"Median reward: {np.median(rewards):.3f}")
print(f"Min/Max: {np.min(rewards):.3f} / {np.max(rewards):.3f}")

# Pass@k
def pass_at_k(rollouts, k):
    examples = {}
    for r in rollouts:
        ex_id = r["example_id"]
        if ex_id not in examples:
            examples[ex_id] = []
        examples[ex_id].append(r["reward"])
    
    successes = sum(max(rewards[:k]) > 0.5 for rewards in examples.values())
    return successes / len(examples)

print(f"Pass@1: {pass_at_k(rollouts, 1):.3f}")
print(f"Pass@3: {pass_at_k(rollouts, 3):.3f}")

Common Workflows

Quick Environment Test

# Fast sanity check during development
prime eval run my-env -m gpt-4.1-mini -n 2 -r 1 -v

Benchmark Evaluation

# Full evaluation with saves
prime eval run my-env -m gpt-4.1-mini -n 100 -r 5 -s

Model Comparison

Create a config:

configs/eval/compare.toml

num_examples = 50
rollouts_per_example = 3

[[eval]]
env_id = "gsm8k"
endpoint_id = "gpt-4.1-mini"

[[eval]]
env_id = "gsm8k"
endpoint_id = "qwen3-235b-i"

[[eval]]
env_id = "gsm8k"
endpoint_id = "claude-sonnet"

Run:

prime eval run configs/eval/compare.toml

Pre-Training Baseline

# Establish baseline before training
prime eval run my-env -m my-model -n 200 -r 5 -s

Post-Training Validation

# Evaluate checkpoints
for checkpoint in checkpoint_*.pt; do
    prime eval run my-env -m $checkpoint -n 100 -s
done

Troubleshooting

Slow Evaluation

Increase concurrency: -c 64
Reduce rollouts per example: -r 1
Use faster model for initial testing
Check network latency to API

Out of Memory

Reduce concurrency: -c 16
Use smaller models
Enable independent scoring: -i

API Rate Limits

Reduce concurrency: -c 8
Use endpoint registry with multiple replicas
Add retries: --max-retries 3

Inconsistent Results

Increase rollouts per example: -r 10
Check temperature setting: -T 0.0 for deterministic
Ensure environment is deterministic (check random seeds)

Next Steps

Training: Use evaluation results to guide RL training → Training Guide
Environment improvements: Analyze failed rollouts to improve your environment
Model selection: Compare models to choose the best starting point for training

Get Started

Core Concepts

Guides

Integrations

​Quick Start

​Basic Usage

​Essential Options

​Examples

​Model Configuration

​Using Model Aliases

​Direct Configuration

​Sampling Parameters

​Environment Configuration

​Passing Arguments to load_environment()

​Overriding Environment Constructor Args

​Evaluation Scope

​Examples and Rollouts

​Concurrency

​Saving and Resuming

​Saving Results

​Resuming Evaluations

​Saving Custom State Columns

​Multi-Environment Evaluation

​Basic Multi-Env Config

​Per-Environment Configuration

​Using Endpoint Registry

​Configuration Precedence

​Output and Display

​Standard Output

​Verbose Mode

​TUI Mode

​Debug Mode

​Advanced Features

​Independent Rollout Scoring

​Disable Interleaved Scoring

​Automatic Retries

​Heartbeat Monitoring

​Push to Hugging Face Hub

​Analyzing Results

​Results Structure

​Reading Results

​Computing Metrics

​Common Workflows

​Quick Environment Test

​Benchmark Evaluation

​Model Comparison

​Pre-Training Baseline

​Post-Training Validation

​Troubleshooting

​Slow Evaluation

​Out of Memory

​API Rate Limits

​Inconsistent Results

​Next Steps

Build docs developers (and LLMs) love

Quick Start

Basic Usage

Essential Options

Examples

Model Configuration

Using Model Aliases

Direct Configuration

Sampling Parameters

Environment Configuration

Passing Arguments to load_environment()

Overriding Environment Constructor Args

Evaluation Scope

Examples and Rollouts

Concurrency

Saving and Resuming

Saving Results

Resuming Evaluations

Saving Custom State Columns

Multi-Environment Evaluation

Basic Multi-Env Config

Per-Environment Configuration

Using Endpoint Registry

Configuration Precedence

Output and Display

Standard Output

Verbose Mode

TUI Mode

Debug Mode

Advanced Features

Independent Rollout Scoring

Disable Interleaved Scoring

Automatic Retries

Heartbeat Monitoring

Push to Hugging Face Hub

Analyzing Results

Results Structure

Reading Results

Computing Metrics

Common Workflows

Quick Environment Test

Benchmark Evaluation

Model Comparison

Pre-Training Baseline

Post-Training Validation

Troubleshooting

Slow Evaluation

Out of Memory

API Rate Limits

Inconsistent Results

Next Steps