Skip to main content
The docker agent eval command runs your agent against a set of saved sessions and scores the results. Use it to catch regressions, validate behavior, and measure agent quality.

Running evaluations

# Run all evals in ./evals/ against agent.yaml
docker agent eval agent.yaml

# Specify a custom eval directory
docker agent eval agent.yaml ./my-evals

# Run with more concurrency
docker agent eval agent.yaml -c 8

# Only run evals matching a pattern
docker agent eval agent.yaml --only "auth"

# Keep containers for debugging
docker agent eval agent.yaml --keep-containers

Flags

FlagDefaultDescription
-c, --concurrency <n>Number of CPUsNumber of concurrent evaluation runs
--judge-model <ref>anthropic/claude-opus-4-5-20251101Model used for relevance checking
--output <dir><eval-dir>/resultsDirectory for results and logs
--only <pattern>(all)Only run evals whose filenames match this pattern (repeatable)
--base-image <image>(built-in)Custom base Docker image for running evaluations
--keep-containersfalseKeep containers after evaluation (skip --rm)
-e, --env <KEY[=VALUE]>Environment variables to pass to containers (repeatable)

How evaluations work

Each evaluation is a JSON file containing a recorded session — the conversation history from a previous run including user messages, tool calls, and the agent’s response. When you run docker agent eval, Docker Agent:
1

Loads eval files

Reads all .json files from the eval directory (or matches --only patterns).
2

Builds a Docker image

Builds a Docker image containing the agent binary. Custom images can be specified with --base-image or per-eval with the image field.
3

Runs the agent in a container

Executes the agent in an isolated Docker container with the user messages from the eval, using --exec --yolo --json mode.
4

Scores the results

Compares the new response against the recorded session using tool trajectory scoring, response size, and LLM-as-a-judge relevance checks.
5

Saves results

Writes results to the output directory: a log file, a SQLite session database, and a JSON summary.

Eval file format

Eval files are session JSON files — the same format produced by /eval in the TUI or exported from the API. The simplest way to create one is to record a session in the TUI and use /eval to export it. A session file contains the conversation history. Optionally, add an evals block to specify scoring criteria:
{
  "id": "41b179a2-ed19-4ae2-a45d-95775aaa90f7",
  "title": "Count files in local folder",
  "messages": [
    {
      "message": {
        "agentFilename": "./agent.yaml",
        "message": {
          "role": "user",
          "content": "How many files in the local folder?"
        }
      }
    },
    {
      "message": {
        "agentName": "root",
        "message": {
          "role": "assistant",
          "content": "",
          "tool_calls": [{
            "function": { "name": "list_directory", "arguments": "{\"path\":\"./ \"}" }
          }]
        }
      }
    }
  ],
  "evals": {
    "relevance": [
      "The response gives a specific number of files",
      "The response lists the file names"
    ],
    "size": "S"
  }
}

Scoring criteria

The evals block in a session file supports these fields:
FieldTypeDescription
relevance[]stringStatements that should be true about the response. Evaluated by the judge model.
sizestringExpected response size: S, M, L, or XL.
working_dirstringSubdirectory under evals/working_dirs/ to use as the working directory.
setupstringShell script to run in the container before the agent runs.
imagestringCustom Docker image for this eval (overrides --base-image).

Tool trajectory scoring

If the recorded session contains tool calls, the eval framework extracts the expected tool call sequence and computes an F1 score against what the agent actually called. A score of 1.0 means the agent used the same tools in the same order.

Relevance scoring

For each string in relevance, the judge model evaluates whether the agent’s response satisfies that criterion. Results are reported as N/M (e.g., 2/3 means 2 out of 3 criteria passed).

Response size

Expected size categories:
SizeDescription
SShort (a few sentences)
MMedium (a paragraph or two)
LLong (several paragraphs)
XLVery long (extensive output)

Setting up evaluations

1

Create an agent config

Write a standard agent YAML file:
# agent.yaml
agents:
  root:
    model: openai/gpt-4o
    instruction: You know how to read and list files.
    toolsets:
      - type: filesystem
2

Create an evals directory

mkdir evals
3

Record a session

Run the agent, perform a task, then use /eval in the TUI to export the session. Save the JSON file to the evals/ directory.Or use docker agent run --exec --json to capture output and save it as a session file.
4

Add scoring criteria (optional)

Edit the session JSON to add an evals block with relevance criteria and/or expected size.
5

Run the evaluation

docker agent eval agent.yaml ./evals

Example output

Evaluation run: golden-wolf-42
Loading evaluation sessions...
Running 2 evaluations with concurrency 8

  ✓ Count files in local folder  (tool calls: 1.0, relevance: 2/2)
  ✓ Check README.md content      (tool calls: 1.0, relevance: 1/1)

Total: 2/2 passed  Cost: $0.005  Duration: 4.2s

Sessions DB: ./evals/results/golden-wolf-42.db
Sessions JSON: ./evals/results/golden-wolf-42.json
Log: ./evals/results/golden-wolf-42.log

Output files

Results are saved to the output directory (default: <eval-dir>/results):
FileDescription
<run-name>.logFull debug log for the evaluation run
<run-name>.dbSQLite database with full session data
<run-name>.jsonJSON summary of all results
Use --keep-containers when a test fails unexpectedly. The container stays running so you can inspect files, environment variables, and logs inside.

Build docs developers (and LLMs) love