sam eval

Overview

The sam eval command runs evaluation test suites to measure and validate the performance of your AI agents. This is useful for regression testing, quality assurance, and continuous improvement of agent responses.

Syntax

sam eval <PATH> [OPTIONS]

Arguments

PATH

string

required

Path to the evaluation test suite configuration file (YAML format).Example: path/to/evaluation_suite.yaml

Options

-v, --verbose

flag

Enable verbose output to see detailed evaluation progress and results.

-h, --help

flag

Show help message and exit.

Description

The evaluation command:

Loads the test suite configuration from the specified YAML file
Runs each test case against your agent mesh
Compares actual responses against expected outputs
Generates evaluation metrics and reports

The command automatically sets the logging configuration path to configs/logging_config.yaml if it exists in your project root.

Test Suite Configuration

An evaluation test suite YAML file defines test cases with expected inputs and outputs. Example structure:

test_suite.yaml

name: "Agent Evaluation Suite"
description: "Test suite for validating agent responses"

test_cases:
  - name: "Weather query test"
    input: "What's the weather in San Francisco?"
    expected_keywords:
      - "temperature"
      - "San Francisco"
    
  - name: "Data analysis test"
    input: "Analyze the sales data in quarterly_report.csv"
    expected_keywords:
      - "sales"
      - "analysis"
      - "quarterly"

The exact schema for test suite configuration files is defined in the evaluation module. Refer to evaluation/run.py for the complete specification.

Examples

Run basic evaluation

sam eval tests/agent_eval.yaml

Output:

Starting evaluation with test_suite_config: /path/to/tests/agent_eval.yaml
Running test case 1/5: Weather query test
✓ Passed
Running test case 2/5: Data analysis test
✓ Passed
...
Evaluation completed successfully.

Run with verbose output

sam eval tests/agent_eval.yaml --verbose

This shows detailed information about:

Each test case execution
Agent responses
Scoring metrics
Performance timing

Organize evaluation suites

# Functional tests
sam eval tests/functional/core_features.yaml

# Regression tests
sam eval tests/regression/v2_compatibility.yaml

# Performance tests
sam eval tests/performance/response_time.yaml

Evaluation Metrics

The evaluation framework can measure:

Response accuracy: Keyword matching, semantic similarity
Response time: Latency and throughput
Tool usage: Correct tool selection and execution
Error handling: Graceful degradation
Consistency: Similar inputs producing similar outputs

Specific metrics available depend on your test suite configuration and the evaluation framework setup.

Implementation

The command delegates to evaluation.run.main() which orchestrates:

Test suite loading and validation
Agent mesh interaction
Response evaluation
Results aggregation and reporting

Source: cli/commands/eval_cmd.py, evaluation/run.py

Troubleshooting

Configuration file not found

Error: File path does not existSolution: Verify the path to your test suite YAML file is correct. Use absolute paths or paths relative to your current working directory.

Evaluation error

Error: An error occurred during evaluation: <error message>Solution: Check that:

Your agent mesh is properly configured
Required services (broker, LLM) are accessible
Test suite YAML is valid
Use --verbose flag for detailed error information

Missing evaluation module

Error: Cannot import from evaluation.runSolution: The evaluation framework must be properly installed. Check that all required dependencies are installed in your environment.

Best Practices

Version control

Keep test suites in version control alongside your agent configurations to track changes over time.

CI/CD integration

Run evaluations in your CI/CD pipeline to catch regressions before deployment.

Incremental testing

Start with basic test cases and gradually add more complex scenarios as your agents evolve.

Baseline metrics

Establish baseline performance metrics for your agents and monitor for degradation.

CLI Commands

Agent API

Gateway API

Workflow API

A2A Protocol

Overview

Syntax

Arguments

Options

Description

Test Suite Configuration

Examples

Run basic evaluation

Run with verbose output

Organize evaluation suites

Evaluation Metrics

Implementation

Troubleshooting

Best Practices

Version control

CI/CD integration

Incremental testing

Baseline metrics

See also

Build docs developers (and LLMs) love

CLI Commands

Agent API

Gateway API

Workflow API

A2A Protocol

​Overview

​Syntax

​Arguments

​Options

​Description

​Test Suite Configuration

​Examples

​Run basic evaluation

​Run with verbose output

​Organize evaluation suites

​Evaluation Metrics

​Implementation

​Troubleshooting

​Best Practices

Version control

CI/CD integration

Incremental testing

Baseline metrics

​See also

Build docs developers (and LLMs) love

Overview

Syntax

Arguments

Options

Description

Test Suite Configuration

Examples

Run basic evaluation

Run with verbose output

Organize evaluation suites

Evaluation Metrics

Implementation

Troubleshooting

Best Practices

See also