The prime eval run command executes rollouts against any supported model provider and reports aggregate metrics. Use it to test environments during development, benchmark models, and validate training progress.
Quick Start
prime eval run my-env -m gpt-4.1-mini -n 10
This runs 10 examples with 3 rollouts each (default) for a total of 30 rollouts.
Loading environment: my-env
Dataset: 10 examples, 3 rollouts per example = 30 total rollouts
Model: gpt-4.1-mini
Progress: ████████████████████ 30/30 (100%)
Results:
Reward: 0.73 ± 0.15
correct_answer: 0.73 ± 0.15
num_turns: 1.0 ± 0.0
response_length: 142.3 ± 45.2
Basic Usage
The basic command structure:
prime eval run < env-i d > [options]
Essential Options
Flag Short Description Default --model-mModel to evaluate gpt-4.1-mini--num-examples-nNumber of dataset examples 5--rollouts-per-example-rRollouts per example 3--save-results-sSave results to disk false--verbose-vEnable debug logging false
Examples
Basic Eval
Save Results
Debug Mode
Custom State Columns
prime eval run gsm8k -m gpt-4.1-mini -n 20
Model Configuration
Using Model Aliases
Define model endpoints in configs/endpoints.toml:
[[ endpoint ]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
[[ endpoint ]]
endpoint_id = "qwen3-235b-i"
model = "qwen/qwen3-235b-a22b-instruct-2507"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"
[[ endpoint ]]
endpoint_id = "claude-sonnet"
model = "claude-sonnet-4-5-20250929"
url = "https://api.anthropic.com"
key = "ANTHROPIC_API_KEY"
api_client_type = "anthropic_messages"
Then use the alias directly:
prime eval run my-env -m qwen3-235b-i
Direct Configuration
Override configuration without using aliases:
prime eval run my-env \
-m custom-model \
-b https://my-api.example.com/v1 \
-k MY_API_KEY
Sampling Parameters
Control model generation:
# Temperature and max tokens
prime eval run my-env -m gpt-4.1-mini -T 0.7 -t 1024
# Additional parameters via JSON
prime eval run my-env -m gpt-4.1-mini \
-S '{"temperature": 0.7, "top_p": 0.9, "presence_penalty": 0.1}'
Environment Configuration
Passing Arguments to load_environment()
Use --env-args to pass arguments:
def load_environment (
difficulty : str = "easy" ,
num_examples : int = - 1 ,
seed : int = 42 ,
):
# Your environment implementation
...
prime eval run my-env \
-a '{"difficulty": "hard", "num_examples": 100, "seed": 123}'
Overriding Environment Constructor Args
Use --extra-env-kwargs to pass arguments directly to the constructor:
# Override max_turns
prime eval run my-env -x '{"max_turns": 20}'
# Override multiple parameters
prime eval run my-env \
-x '{"max_turns": 15, "system_prompt": "Custom prompt"}'
Evaluation Scope
Examples and Rollouts
Control the evaluation size:
# 50 examples × 5 rollouts = 250 total rollouts
prime eval run my-env -n 50 -r 5
Why multiple rollouts per example?
Measure variance in model responses
Compute pass@k metrics
Enable advantage-based RL training
Concurrency
Control parallel execution:
# 64 concurrent requests
prime eval run my-env -m gpt-4.1-mini -n 100 -c 64
# Separate generation and scoring concurrency
prime eval run my-env -n 100 \
--max-concurrent-generation 32 \
--max-concurrent-scoring 64
Saving and Resuming
Saving Results
Enable checkpointing with -s:
prime eval run my-env -m gpt-4.1-mini -n 100 -s
Results saved to:
./environments/my_env/outputs/evals/my-env--openai--gpt-4.1-mini/{run_id}/
├── results.jsonl # One rollout per line
└── metadata.json # Configuration and metrics
Resuming Evaluations
prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s
If interrupted, resume from checkpoint
# Auto-detect latest incomplete run
prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s --resume
# Or specify exact path
prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s \
--resume ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/abc12345
When resuming:
Existing completed rollouts are loaded
Only incomplete rollouts are executed
Results are appended to the existing checkpoint
If all rollouts complete, returns immediately
Configuration compatibility : When resuming, use the same configuration (model, env-args, rollouts-per-example). Mismatches can lead to undefined behavior.
Saving Custom State Columns
Save environment-specific state fields:
prime eval run my-env -s -C "judge_response,parsed_answer,reasoning_trace"
Default columns: query, completion, expected_answer, reward, error
Access in results.jsonl:
{
"example_id" : 0 ,
"reward" : 0.8 ,
"completion" : [ ... ],
"judge_response" : "Correct" ,
"parsed_answer" : "42" ,
"reasoning_trace" : "First I..."
}
Multi-Environment Evaluation
Evaluate multiple environments with a single command using TOML configs.
Basic Multi-Env Config
configs/eval/my-benchmark.toml
# Global defaults
model = "gpt-4.1-mini"
num_examples = 50
rollouts_per_example = 3
[[ eval ]]
env_id = "gsm8k"
num_examples = 100 # Override global
[[ eval ]]
env_id = "math-python"
[[ eval ]]
env_id = "reverse-text"
rollouts_per_example = 5 # Override global
Run all evaluations:
prime eval run configs/eval/my-benchmark.toml
Per-Environment Configuration
configs/eval/detailed.toml
model = "gpt-4.1-mini"
[[ eval ]]
env_id = "math-python"
num_examples = 50
[ eval . env_args ]
difficulty = "hard"
split = "test"
[[ eval ]]
env_id = "wiki-search"
num_examples = 30
[ eval . env_args ]
max_turns = 15
judge_model = "gpt-4.1-mini"
Using Endpoint Registry
configs/eval/multi-model.toml
endpoints_path = "./configs/endpoints.toml"
[[ eval ]]
env_id = "gsm8k"
endpoint_id = "gpt-4.1-mini"
num_examples = 100
[[ eval ]]
env_id = "gsm8k"
endpoint_id = "qwen3-235b-i"
num_examples = 100
[[ eval ]]
env_id = "gsm8k"
endpoint_id = "claude-sonnet"
num_examples = 100
This runs the same environment with different models for comparison.
Configuration Precedence
When using CLI only :
CLI arguments (highest priority)
Environment defaults from pyproject.toml
Built-in defaults (lowest priority)
When using TOML config :
Per-eval settings in [[eval]] sections (highest priority)
Global settings at top of TOML
Environment defaults from pyproject.toml
Built-in defaults (lowest priority)
When using a TOML config file, all CLI arguments are ignored.
Output and Display
Standard Output
Default display shows progress and summary:
Loading environment: my-env
Dataset: 50 examples, 3 rollouts per example = 150 total rollouts
Model: gpt-4.1-mini
Progress: ████████████████████ 150/150 (100%)
Results:
Reward: 0.82 ± 0.11
correct_answer: 0.82 ± 0.11
num_turns: 6.3 ± 2.1
total_tool_calls: 8.7 ± 3.2
search_pages_calls: 2.1 ± 1.0
read_section_calls: 6.6 ± 2.8
Verbose Mode
Enable detailed logging:
prime eval run my-env -m gpt-4.1-mini -n 2 -v
Shows:
Model requests and responses
Tool calls and results
Reward function execution
State updates
Timing information
TUI Mode
Use alternate screen mode for cleaner display:
prime eval run my-env -m gpt-4.1-mini -n 100 -u
Debug Mode
Disable Rich display, use standard logging:
prime eval run my-env -m gpt-4.1-mini -n 10 -d
Useful for:
CI/CD environments
Piping output to files
Debugging display issues
Advanced Features
Independent Rollout Scoring
By default, rollouts are scored in groups (all rollouts for the same example together). This enables group-based reward functions and pass@k metrics.
Score rollouts independently instead:
prime eval run my-env -m gpt-4.1-mini -n 10 -i
Use when:
You don’t need group-based metrics
You want faster completion (scoring happens as rollouts finish)
Your reward functions don’t use plural arguments (completions, prompts, etc.)
Disable Interleaved Scoring
By default, scoring happens as rollouts complete. Disable to score all at once:
prime eval run my-env -m gpt-4.1-mini -n 10 -N
Automatic Retries
Enable retries for transient infrastructure errors:
prime eval run my-env -m gpt-4.1-mini -n 100 --max-retries 3
Retries with exponential backoff on:
Sandbox timeouts
API failures
Network errors
Other vf.InfraError instances
Heartbeat Monitoring
Send heartbeat pings during evaluation:
prime eval run my-env -m gpt-4.1-mini -n 1000 \
--heartbeat-url https://status.example.com/ping/abc123
Useful for monitoring long-running evaluations.
Push to Hugging Face Hub
Publish results to HF Hub:
prime eval run my-env -m gpt-4.1-mini -n 100 -s \
--save-to-hf-hub \
--hf-hub-dataset-name my-org/my-env-results
Analyzing Results
After saving results with -s, analyze the output files.
Results Structure
outputs/evals/my-env--openai--gpt-4.1-mini/{run_id}/
├── results.jsonl # One rollout per line
└── metadata.json # Configuration and metrics
Reading Results
import json
# Load metadata
with open ( "metadata.json" ) as f:
metadata = json.load(f)
print ( f "Reward: { metadata[ 'reward' ] } ± { metadata[ 'reward_std' ] } " )
# Load individual rollouts
rollouts = []
with open ( "results.jsonl" ) as f:
for line in f:
rollouts.append(json.load(line))
# Analyze
for rollout in rollouts:
print ( f "Example { rollout[ 'example_id' ] } : reward= { rollout[ 'reward' ] } " )
print ( f "Completion: { rollout[ 'completion' ][ - 1 ][ 'content' ][: 100 ] } ..." )
Computing Metrics
import numpy as np
rewards = [r[ "reward" ] for r in rollouts]
print ( f "Mean reward: { np.mean(rewards) :.3f} " )
print ( f "Median reward: { np.median(rewards) :.3f} " )
print ( f "Min/Max: { np.min(rewards) :.3f} / { np.max(rewards) :.3f} " )
# Pass@k
def pass_at_k ( rollouts , k ):
examples = {}
for r in rollouts:
ex_id = r[ "example_id" ]
if ex_id not in examples:
examples[ex_id] = []
examples[ex_id].append(r[ "reward" ])
successes = sum ( max (rewards[:k]) > 0.5 for rewards in examples.values())
return successes / len (examples)
print ( f "Pass@1: { pass_at_k(rollouts, 1 ) :.3f} " )
print ( f "Pass@3: { pass_at_k(rollouts, 3 ) :.3f} " )
Common Workflows
Quick Environment Test
# Fast sanity check during development
prime eval run my-env -m gpt-4.1-mini -n 2 -r 1 -v
Benchmark Evaluation
# Full evaluation with saves
prime eval run my-env -m gpt-4.1-mini -n 100 -r 5 -s
Model Comparison
Create a config:
configs/eval/compare.toml
num_examples = 50
rollouts_per_example = 3
[[ eval ]]
env_id = "gsm8k"
endpoint_id = "gpt-4.1-mini"
[[ eval ]]
env_id = "gsm8k"
endpoint_id = "qwen3-235b-i"
[[ eval ]]
env_id = "gsm8k"
endpoint_id = "claude-sonnet"
Run:
prime eval run configs/eval/compare.toml
Pre-Training Baseline
# Establish baseline before training
prime eval run my-env -m my-model -n 200 -r 5 -s
Post-Training Validation
# Evaluate checkpoints
for checkpoint in checkpoint_*.pt ; do
prime eval run my-env -m $checkpoint -n 100 -s
done
Troubleshooting
Slow Evaluation
Increase concurrency: -c 64
Reduce rollouts per example: -r 1
Use faster model for initial testing
Check network latency to API
Out of Memory
Reduce concurrency: -c 16
Use smaller models
Enable independent scoring: -i
API Rate Limits
Reduce concurrency: -c 8
Use endpoint registry with multiple replicas
Add retries: --max-retries 3
Inconsistent Results
Increase rollouts per example: -r 10
Check temperature setting: -T 0.0 for deterministic
Ensure environment is deterministic (check random seeds)
Next Steps
Training : Use evaluation results to guide RL training → Training Guide
Environment improvements : Analyze failed rollouts to improve your environment
Model selection : Compare models to choose the best starting point for training