Skip to main content
AutoGenBench (agbench) is a tool for repeatedly running a set of pre-defined AutoGen tasks in a setting with tightly-controlled initial conditions. With each run, AutoGenBench starts from a blank slate, requiring agents to work out what code needs to be written and what libraries or dependencies to install to solve tasks.
AutoGenBench works with all AutoGen 0.1.* and 0.2.* versions.

Key Features

Reproducible Testing

Run agents in fresh Docker containers for consistent, isolated testing

Comprehensive Logging

Detailed logs of agent behavior, code execution, and task results

Built-in Benchmarks

Pre-configured benchmarks like HumanEval, GAIA, and AssistantBench

Metrics Analysis

Built-in tools to tabulate and analyze benchmark results

Requirements

AutoGenBench requires Docker (Desktop or Engine) and will not run in GitHub Codespaces unless you opt for native execution (strongly discouraged).

Docker Installation

Install Docker Desktop from https://www.docker.com/products/docker-desktop/

WSL Setup (Windows)

If you’re working in WSL:
1

Install Docker Desktop

Download and install Docker Desktop. Restart is required after installation.
2

Enable WSL Integration

Open Docker Desktop → Settings → Resources → WSL IntegrationEnable integration with your Ubuntu distribution.
3

Clone and configure AutoGen

git clone [email protected]:microsoft/autogen.git
export AUTOGEN_REPO_BASE=<path_to_autogen>
The AUTOGEN_REPO_BASE environment variable enables Docker containers to use the correct version of agents.

Installation

Install AutoGenBench from the source repository:
pip install -e autogen/python/packages/agbench

API Key Configuration

AutoGenBench requires API keys for LLM access. Configure them using one of these methods:
export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)
This is the most convenient option for running multiple benchmarks.

Additional API Keys

Some benchmark scenarios require additional keys (e.g., Bing Search API). Add them to an ENV.json file in your working folder:
ENV.json
{
    "BING_API_KEY": "xxxyyyzzz"
}

Quick Start

Here’s a typical workflow for running the HumanEval benchmark:
1

Navigate to benchmark directory

cd autogen/python/packages/agbench/benchmarks/HumanEval
2

Create ENV.json configuration

For Azure OpenAI:
ENV.json
{
    "CHAT_COMPLETION_KWARGS_JSON": "{}",
    "CHAT_COMPLETION_PROVIDER": "azure"
}
For OpenAI:
ENV.json
{
  "CHAT_COMPLETION_PROVIDER": "openai",
  "CHAT_COMPLETION_KWARGS_JSON": "{\"api_key\": \"YOUR_API_KEY\", \"model\": \"gpt-4\"}"
}
3

Initialize tasks

python Scripts/init_tasks.py
This will download HumanEval and create task files in the Tasks/ directory.
4

Run the benchmark

agbench run Tasks/human_eval_MagenticOne.jsonl
You’ll see raw logs showing the agents in action.
5

Tabulate results

In a new terminal, view the summary:
agbench tabulate Results/human_eval_MagenticOne

Command Reference

agbench run

Run benchmark scenarios with controlled initial conditions.
agbench run [OPTIONS] scenario
scenario
string
required
The JSONL scenario file to run. If a directory is specified, all JSONL scenarios in the directory are run.
--config, -c
string
default:"OAI_CONFIG_LIST"
Environment variable name or path to the OAI_CONFIG_LIST
--repeat, -r
number
default:"1"
Number of repetitions to run for each scenario
--subsample, -s
string
default:"1.0"
Run on a subsample of tasks:
  • Decimal (e.g., 0.7): Run on 70% of tasks
  • Integer (e.g., 7): Run exactly 7 tasks from each file
--model, -m
string
Filter config_list to include only models matching the provided name
--docker-image, -d
string
default:"agbench:default"
Docker image to use when running scenarios. Cannot be used with --native.
--native
boolean
Run scenarios natively rather than in Docker
This is not advisable and should be done with great caution.
--requirements
string
Requirements file to pip install before running the scenario

Examples

# Run all tasks in a file
agbench run Tasks/human_eval_MagenticOne.jsonl

# Run each task 10 times
agbench run --repeat 10 Tasks/human_eval_MagenticOne.jsonl

# Run on 70% of tasks
agbench run --subsample 0.7 Tasks/human_eval_MagenticOne.jsonl

# Run only 5 random tasks
agbench run --subsample 5 Tasks/human_eval_MagenticOne.jsonl

# Use specific model
agbench run --model gpt-4 Tasks/human_eval_MagenticOne.jsonl

# Use custom Docker image
agbench run --docker-image my-custom-image:latest Tasks/human_eval_MagenticOne.jsonl

agbench tabulate

Tabulate and analyze benchmark results.
agbench tabulate results_directory

Example

# View summary of results
agbench tabulate Results/human_eval_MagenticOne

agbench remove_missing

Remove missing or incomplete results from the results directory.
agbench remove_missing results_directory

Results Structure

AutoGenBench stores results in a hierarchical folder structure:
./results/[scenario]/[task_id]/[instance_id]

Example Structure

./results/default_two_agents/two_agent_stocks/0
./results/default_two_agents/two_agent_stocks/1
...
./results/default_two_agents/two_agent_stocks/9
  • scenario: The benchmark scenario being run
  • task_id: Maps to a specific prompt or set of parameters
  • instance_id: A specific attempt or run (0-9 for 10 repetitions)

Result Files

Each result directory contains:
Records the date and time of the run, along with the version of the autogen-agentchat library installed.
All console output produced by Docker when running AutoGen. Read this like you would a regular console.
For each agent, a log of their message dictionaries showing the conversation flow.
Contains all code written by AutoGen and all artifacts produced by that code.

Built-in Benchmarks

HumanEval

Code generation benchmark with programming problems

GAIA

General AI assistants benchmark for complex reasoning tasks

AssistantBench

Assistant capabilities evaluation across various domains
Each benchmark has its own README in the benchmarks/ directory with specific instructions and requirements.

Creating Custom Benchmarks

To define your own tasks or benchmarks, review the contributor’s guide for complete technical details on:
  • Task definition format (JSONL)
  • Scenario templates
  • Custom evaluation metrics
  • Benchmark contribution guidelines

Best Practices

Use Docker

Always run benchmarks in Docker for consistency and safety

Multiple Runs

Use --repeat to run multiple iterations for statistical significance

Subsample First

Test with --subsample on a small set before running full benchmarks

Monitor Logs

Review console logs to understand agent behavior and failures

Troubleshooting

Docker Not Running

Error: Cannot connect to Docker daemon
Solution: Ensure Docker Desktop is running and accessible.

Missing AUTOGEN_REPO_BASE

Error: AUTOGEN_REPO_BASE environment variable not set
Solution: Export the path to your AutoGen repository:
export AUTOGEN_REPO_BASE=/path/to/autogen

API Key Issues

Error: No API key found
Solution: Set OAI_CONFIG_LIST environment variable or file, or use OPENAI_API_KEY.

Get Help

For detailed help on any command:
agbench --help
agbench run --help
agbench tabulate --help
agbench remove_missing --help

Resources

GitHub Repository

View source code and contribute

Contributing Guide

Learn how to create custom benchmarks

Build docs developers (and LLMs) love