Skip to main content

Overview

The prime eval tui command launches an interactive terminal user interface for browsing evaluation results. It automatically discovers and displays all evaluation runs from your workspace.

Usage

prime eval tui [OPTIONS]

Options

No required options. The TUI will auto-discover results from standard locations.

What It Shows

The TUI provides a hierarchical browser:
  1. Environment selection - All environments with completed evaluations
  2. Model selection - All models evaluated for that environment
  3. Run selection - All evaluation runs for that environment + model combo
  4. Rollout viewer - Individual prompts, completions, and metrics

Discovery

Results are discovered from:
  • ./outputs/evals/ - Global output directory
  • ./environments/*/outputs/evals/ - Per-environment output directories
Each run must have both:
  • results.jsonl - Rollout data
  • metadata.json - Evaluation metadata

Environment Selection Screen

┌─ Select Environment ─────────────────────────────────┐
│ gsm8k - Models: 2, Runs: 5                          │
│ math-python - Models: 1, Runs: 3                    │
│ alphabet-sort - Models: 2, Runs: 4                  │
└──────────────────────────────────────────────────────┘

q: Quit  Enter: Select
Keys:
  • / or j/k - Navigate
  • Enter - Select environment
  • q - Quit

Model Selection Screen

┌─ Environment: gsm8k ─────────────────────────────────┐
│ Select Model                                         │
│ openai/gpt-4.1-mini - Runs: 3                       │
│ anthropic/claude-sonnet-4 - Runs: 2                 │
└──────────────────────────────────────────────────────┘

q: Quit  b: Back  Enter: Select
Keys:
  • / - Navigate
  • Enter - Select model
  • b or Backspace - Go back
  • q - Quit

Run Selection Screen

┌─ Environment: gsm8k ─────────────────────────────────┐
│ Model: openai/gpt-4.1-mini                          │
│ Select Run                                           │
│ abc123 - 2026-03-03 14:23 | Reward: 0.867           │
│ def456 - 2026-03-02 10:15 | Reward: 0.823           │
│ ghi789 - 2026-03-01 16:47 | Reward: 0.891           │
└──────────────────────────────────────────────────────┘

┌─ Run Details ────────────────────────────────────────┐
│ Run ID: abc123                                       │
│ Environment: gsm8k                                   │
│ Model: openai/gpt-4.1-mini                          │
│ Avg reward: 0.867                                    │
│ Runtime: 2m 34.5s                                    │
└──────────────────────────────────────────────────────┘

q: Quit  b: Back  Enter: Select
Keys:
  • / - Navigate runs
  • Enter - View rollout details
  • b or Backspace - Go back
  • q - Quit

Rollout Viewer

The main screen shows prompts, completions, and metrics side-by-side:
┌─ Metadata ───────────────────────────────────────────┐
│ Environment: gsm8k      Record: 1/30                 │
│ Model: openai/gpt-4.1-mini                          │
│ Run ID: abc123          Examples: 10                 │
│ Date: 2026-03-03 14:23  Rollouts/ex: 3              │
└──────────────────────────────────────────────────────┘

┌─ Prompt ─────────────┬─ Completion ──────────────────┐
│ user: What is 2+2?   │ assistant: The answer is 4.   │
│                      │                                │
│                      │ tool call: calculate          │
│                      │ {"expression": "2+2"}         │
│                      │                                │
│                      │ tool result: 4                │
│                      │                                │
│                      │ assistant: The result is 4.   │
└──────────────────────┴────────────────────────────────┘

┌─ Details ────────────────────────────────────────────┐
│ Reward: 1.000                                        │
│ Answer: 4                                            │
│ Info: {"calculation": "2+2=4"}                      │
└──────────────────────────────────────────────────────┘

q: Quit  b: Back  ←/→: Prev/Next  s: Search  c: Copy
Keys:
  • / or h/l - Navigate between rollouts
  • s - Search prompt/completion text
  • c - Enter copy mode
  • b or Backspace - Go back to run list
  • q - Quit
  • d - Toggle dark/light theme

Search Mode

Press s to search within prompts and completions:
┌─ Search (regex, case-insensitive) ───────────────────┐
│ [calculate               ]                           │
└──────────────────────────────────────────────────────┘

┌─ Prompt results (0) ────┬─ Completion results (2) ───┐
│                          │   245 | tool call: calculate│
│                          │   312 | assistant: The calcu│
└──────────────────────────┴────────────────────────────┘

Esc: Close  Enter: Select  ←/→: Switch column
Keys:
  • Type to search (regex supported)
  • / - Navigate results
  • / - Switch between prompt and completion results
  • Enter - Jump to selected match
  • Esc - Close search
Matches are highlighted for 3 seconds after selection.

Copy Mode

Press c to enter copy mode:
┌─ Copy Mode ──────────────────────────────────────────┐
│ Tab: switch columns                                  │
│ Highlight text with mouse drag or Shift+Arrow       │
│ Esc: close                                           │
└──────────────────────────────────────────────────────┘

┌─ Prompt ─────────────┬─ Completion ──────────────────┐
│ user: What is 2+2?   │ assistant: The answer is 4.   │
│                      │                                │
│ [Text is selectable] │ [Text is selectable]          │
└──────────────────────┴────────────────────────────────┘

q: Quit  Tab: Next column  c: Copy  Esc: Close
Keys:
  • Tab - Switch between prompt and completion
  • Mouse drag or Shift+Arrow - Select text
  • c - Copy selected text to clipboard
  • Esc - Exit copy mode
  • q - Quit

Display Features

Message Formatting

Messages are formatted with role-based styling:
  • user messages - Standard text
  • assistant messages - assistant: prefix in bold
  • tool calls - tool call: prefix with function name and arguments
  • tool results - tool result: prefix in dimmed style
  • errors - Red text with error: prefix

Metrics Display

The details panel shows:
  • Reward - Scalar reward from rubric (formatted to 3 decimals)
  • Answer - Ground truth answer from task
  • Info - Additional environment-specific data (formatted as JSON)
  • Task - Full task data if available

Lazy Loading

Results are loaded lazily for performance:
  • File handles opened on-demand
  • Lines read as needed
  • Metadata count used when available
  • Caching for already-read records
This allows the TUI to handle evaluations with thousands of rollouts efficiently.

Themes

Toggle between dark and light themes with d:
  • black-warm (default) - Dark theme with warm accent colors
  • white-warm - Light theme with matching warm tones

Examples

Basic Usage

# Run an evaluation
prime eval run gsm8k -m gpt-4.1-mini -n 10 -s

# Launch TUI
prime eval tui

View Specific Results

The TUI automatically finds all results, so just launch it:
prime eval tui
Navigate to your environment → model → run.

Search for Patterns

  1. Launch TUI and navigate to a run
  2. Press s to open search
  3. Type a regex pattern (e.g., error|failed)
  4. Navigate results with arrow keys
  5. Press Enter to jump to a match

Copy Completions

  1. Navigate to a rollout
  2. Press c to enter copy mode
  3. Tab to completion column
  4. Select text with mouse or Shift+Arrow
  5. Press c to copy to clipboard

File Locations

Results are saved by prime eval run --save-results to:
./outputs/evals/
└── gsm8k--openai--gpt-4.1-mini/
    └── abc123/
        ├── results.jsonl
        └── metadata.json
Or per-environment:
./environments/gsm8k/outputs/evals/
└── gsm8k--openai--gpt-4.1-mini/
    └── abc123/
        ├── results.jsonl
        └── metadata.json
The TUI scans both locations.

Performance

The TUI is optimized for large evaluations:
  • Lazy file reading - Only loads visible data
  • Incremental parsing - Reads JSONL line-by-line
  • Metadata caching - Avoids re-parsing metadata files
  • Efficient rendering - Textual’s virtual DOM
Evaluations with 10,000+ rollouts are handled smoothly.

Troubleshooting

No Evaluations Found

┌─ Select Environment ─────────────────────────────────┐
│ No completed evals found                            │
└──────────────────────────────────────────────────────┘
Solution: Run an evaluation with --save-results:
prime eval run gsm8k -m gpt-4.1-mini -n 5 -s
prime eval tui

Corrupted Results

If results.jsonl is malformed, that rollout will show as {}. Solution: Check the file manually:
cat outputs/evals/gsm8k--openai--gpt-4.1-mini/abc123/results.jsonl | jq

Terminal Size

If the TUI appears cramped, resize your terminal:
resize -s 40 160  # 40 rows, 160 columns
prime eval tui
Recommended minimum: 24 rows × 80 columns

Search Not Working

Search uses regex with case-insensitive matching. Test your pattern:
echo "test string" | grep -iE 'pattern'
If the pattern is invalid, an error appears below the search box.

Keyboard Reference

Global

  • q - Quit application
  • d - Toggle dark/light theme
  • / or j/k - Move selection
  • Enter - Select item
  • b or Backspace - Go back one screen

Rollout Viewer

  • / or h/l - Previous/next rollout
  • s - Open search
  • c - Enter copy mode
  • b or Backspace - Return to run list

Search Mode

  • Type - Enter search pattern
  • / - Navigate results
  • / - Switch prompt/completion
  • Enter - Jump to selected match
  • Esc - Close search

Copy Mode

  • Tab or Shift+Tab - Switch column
  • Mouse drag - Select text
  • Shift+Arrow - Select text (keyboard)
  • c - Copy to clipboard
  • Esc - Exit copy mode

Tips

  • Use search (s) to quickly find errors or specific patterns
  • Copy mode (c) allows extracting full completions for analysis
  • Results persist across runs - view historical evaluations anytime
  • The TUI works great with tmux/screen for remote evaluation monitoring
  • Use --state-columns when running evals to save additional fields visible in the TUI

Build docs developers (and LLMs) love