Overview
Theprime eval run command executes rollouts against model APIs and reports aggregate metrics. It supports single-environment evaluations or multi-environment benchmark suites via TOML config files.
Usage
Arguments
Either:
- Environment ID:
gsm8k,primeintellect/math-python - TOML config path:
configs/eval/benchmark.toml(for multi-environment evals)
Model Configuration
Model name or endpoint alias from the registry.Aliases:
-mAPI base URL. Overrides endpoint registry.Aliases:
-bEnvironment variable containing API key.Aliases:
-kClient type:
openai_chat_completions, openai_completions, openai_chat_completions_token, or anthropic_messages.Path to TOML endpoint registry.Aliases:
-eProvider shorthand (
prime, openai, anthropic, openrouter, deepseek, minimax, glm, local, vllm).Aliases: -pExtra HTTP header (
Name: Value). Repeatable.Sampling Parameters
Maximum tokens to generate.Aliases:
-tSampling temperature.Aliases:
-TAdditional sampling parameters as JSON object.Aliases:
-SExample: -S '{"top_p": 0.9, "frequency_penalty": 0.5}'Environment Configuration
Arguments passed to
load_environment() as JSON.Aliases: -aExample: -a '{"difficulty": "hard"}'Arguments passed directly to environment constructor.Aliases:
-xExample: -x '{"max_turns": 20}'Base path for environment outputs.
Evaluation Scope
Number of dataset examples to evaluate.Aliases:
-nRollouts per example (for pass@k metrics).Aliases:
-rConcurrency
Maximum concurrent requests (both generation and scoring).Aliases:
-cConcurrent generation requests (defaults to
--max-concurrent).Concurrent scoring requests (defaults to
--max-concurrent).Disable interleaved scoring (score all rollouts after generation completes).Aliases:
-NScore each rollout individually instead of by group.Aliases:
-iRetries per rollout on transient infrastructure errors.
Output and Display
Enable debug logging.Aliases:
-vUse alternate screen mode (TUI) for live display.Aliases:
-uDisable Rich display; use normal logging and tqdm progress.Aliases:
-dSave results to disk in
./outputs/evals/ or ./environments/*/outputs/evals/.Aliases: -sExtra state columns to save (comma-separated).Aliases:
-CExample: -C "judge_response,parsed_answer"Resume from a previous run. Optionally provide a path; if omitted, auto-detect latest incomplete run.Aliases:
-RPush results to Hugging Face Hub.Aliases:
-HDataset name for HF Hub upload.Aliases:
-DHeartbeat URL for uptime monitoring.
Do not start environment servers for OpenEnv environments.
Examples
Basic Evaluation
With Custom Sampling
With Environment Arguments
Save and Resume
Using Anthropic API
Multi-Environment Benchmark
High Concurrency
Debug Mode
Configuration Files
Endpoint Registry
Define model endpoints inconfigs/endpoints.toml:
Multi-Environment Config
Results Output
With--save-results, outputs are saved to:
results.jsonl Format
Each line contains one rollout:metadata.json Format
Configuration Precedence
CLI Mode
- CLI flags
- Environment defaults (from
pyproject.toml) - Built-in defaults
TOML Config Mode
- Per-eval settings (
[[eval]]sections) - Global settings (top of config file)
- Environment defaults (from
pyproject.toml) - Built-in defaults
Environment Defaults
Environments can specify defaults inpyproject.toml:
Resuming Evaluations
Long evaluations can be resumed:- Same
env_id,model, androllouts_per_example num_examplesmust be >= original target- Results directory must contain valid
results.jsonlandmetadata.json