docker agent eval command runs your agent against a set of saved sessions and scores the results. Use it to catch regressions, validate behavior, and measure agent quality.
Running evaluations
Flags
| Flag | Default | Description |
|---|---|---|
-c, --concurrency <n> | Number of CPUs | Number of concurrent evaluation runs |
--judge-model <ref> | anthropic/claude-opus-4-5-20251101 | Model used for relevance checking |
--output <dir> | <eval-dir>/results | Directory for results and logs |
--only <pattern> | (all) | Only run evals whose filenames match this pattern (repeatable) |
--base-image <image> | (built-in) | Custom base Docker image for running evaluations |
--keep-containers | false | Keep containers after evaluation (skip --rm) |
-e, --env <KEY[=VALUE]> | — | Environment variables to pass to containers (repeatable) |
How evaluations work
Each evaluation is a JSON file containing a recorded session — the conversation history from a previous run including user messages, tool calls, and the agent’s response. When you rundocker agent eval, Docker Agent:
Builds a Docker image
Builds a Docker image containing the agent binary. Custom images can be specified with
--base-image or per-eval with the image field.Runs the agent in a container
Executes the agent in an isolated Docker container with the user messages from the eval, using
--exec --yolo --json mode.Scores the results
Compares the new response against the recorded session using tool trajectory scoring, response size, and LLM-as-a-judge relevance checks.
Eval file format
Eval files are session JSON files — the same format produced by/eval in the TUI or exported from the API. The simplest way to create one is to record a session in the TUI and use /eval to export it.
A session file contains the conversation history. Optionally, add an evals block to specify scoring criteria:
Scoring criteria
Theevals block in a session file supports these fields:
| Field | Type | Description |
|---|---|---|
relevance | []string | Statements that should be true about the response. Evaluated by the judge model. |
size | string | Expected response size: S, M, L, or XL. |
working_dir | string | Subdirectory under evals/working_dirs/ to use as the working directory. |
setup | string | Shell script to run in the container before the agent runs. |
image | string | Custom Docker image for this eval (overrides --base-image). |
Tool trajectory scoring
If the recorded session contains tool calls, the eval framework extracts the expected tool call sequence and computes an F1 score against what the agent actually called. A score of1.0 means the agent used the same tools in the same order.
Relevance scoring
For each string inrelevance, the judge model evaluates whether the agent’s response satisfies that criterion. Results are reported as N/M (e.g., 2/3 means 2 out of 3 criteria passed).
Response size
Expected size categories:| Size | Description |
|---|---|
S | Short (a few sentences) |
M | Medium (a paragraph or two) |
L | Long (several paragraphs) |
XL | Very long (extensive output) |
Setting up evaluations
Record a session
Run the agent, perform a task, then use
/eval in the TUI to export the session. Save the JSON file to the evals/ directory.Or use docker agent run --exec --json to capture output and save it as a session file.Add scoring criteria (optional)
Edit the session JSON to add an
evals block with relevance criteria and/or expected size.Example output
Output files
Results are saved to the output directory (default:<eval-dir>/results):
| File | Description |
|---|---|
<run-name>.log | Full debug log for the evaluation run |
<run-name>.db | SQLite database with full session data |
<run-name>.json | JSON summary of all results |