AutoGenBench - AutoGen

AutoGenBench (agbench) is a tool for repeatedly running a set of pre-defined AutoGen tasks in a setting with tightly-controlled initial conditions. With each run, AutoGenBench starts from a blank slate, requiring agents to work out what code needs to be written and what libraries or dependencies to install to solve tasks.

AutoGenBench works with all AutoGen 0.1.* and 0.2.* versions.

Key Features

Reproducible Testing

Run agents in fresh Docker containers for consistent, isolated testing

Comprehensive Logging

Detailed logs of agent behavior, code execution, and task results

Built-in Benchmarks

Pre-configured benchmarks like HumanEval, GAIA, and AssistantBench

Metrics Analysis

Built-in tools to tabulate and analyze benchmark results

Requirements

AutoGenBench requires Docker (Desktop or Engine) and will not run in GitHub Codespaces unless you opt for native execution (strongly discouraged).

Docker Installation

Install Docker Desktop from https://www.docker.com/products/docker-desktop/

WSL Setup (Windows)

If you’re working in WSL:

Install Docker Desktop

Download and install Docker Desktop. Restart is required after installation.

Enable WSL Integration

Open Docker Desktop → Settings → Resources → WSL IntegrationEnable integration with your Ubuntu distribution.

Clone and configure AutoGen

git clone [email protected]:microsoft/autogen.git
export AUTOGEN_REPO_BASE=<path_to_autogen>

The AUTOGEN_REPO_BASE environment variable enables Docker containers to use the correct version of agents.

Installation

Install AutoGenBench from the source repository:

pip install -e autogen/python/packages/agbench

API Key Configuration

AutoGenBench requires API keys for LLM access. Configure them using one of these methods:

Environment Variable
Config File
OpenAI API Key

export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)

This is the most convenient option for running multiple benchmarks.

Create an OAI_CONFIG_LIST file in your working directory:

OAI_CONFIG_LIST

[
  {
    "model": "gpt-4",
    "api_key": "your-api-key-here"
  }
]

If no OAI_CONFIG_LIST is provided, AutoGenBench will use the OPENAI_API_KEY environment variable:

export OPENAI_API_KEY=your-api-key-here

Additional API Keys

Some benchmark scenarios require additional keys (e.g., Bing Search API). Add them to an ENV.json file in your working folder:

ENV.json

{
    "BING_API_KEY": "xxxyyyzzz"
}

Quick Start

Here’s a typical workflow for running the HumanEval benchmark:

Navigate to benchmark directory

cd autogen/python/packages/agbench/benchmarks/HumanEval

Create ENV.json configuration

For Azure OpenAI:

ENV.json

{
    "CHAT_COMPLETION_KWARGS_JSON": "{}",
    "CHAT_COMPLETION_PROVIDER": "azure"
}

For OpenAI:

ENV.json

{
  "CHAT_COMPLETION_PROVIDER": "openai",
  "CHAT_COMPLETION_KWARGS_JSON": "{\"api_key\": \"YOUR_API_KEY\", \"model\": \"gpt-4\"}"
}

Initialize tasks

python Scripts/init_tasks.py

This will download HumanEval and create task files in the Tasks/ directory.

Run the benchmark

agbench run Tasks/human_eval_MagenticOne.jsonl

You’ll see raw logs showing the agents in action.

Tabulate results

In a new terminal, view the summary:

agbench tabulate Results/human_eval_MagenticOne

Command Reference

agbench run

Run benchmark scenarios with controlled initial conditions.

agbench run [OPTIONS] scenario

scenario

string

required

The JSONL scenario file to run. If a directory is specified, all JSONL scenarios in the directory are run.

--config, -c

string

default:"OAI_CONFIG_LIST"

Environment variable name or path to the OAI_CONFIG_LIST

--repeat, -r

number

default:"1"

Number of repetitions to run for each scenario

--subsample, -s

string

default:"1.0"

Run on a subsample of tasks:

Decimal (e.g., 0.7): Run on 70% of tasks
Integer (e.g., 7): Run exactly 7 tasks from each file

--model, -m

string

Filter config_list to include only models matching the provided name

--docker-image, -d

string

default:"agbench:default"

Docker image to use when running scenarios. Cannot be used with --native.

--native

boolean

Run scenarios natively rather than in Docker

This is not advisable and should be done with great caution.

--requirements

string

Requirements file to pip install before running the scenario

Examples

# Run all tasks in a file
agbench run Tasks/human_eval_MagenticOne.jsonl

# Run each task 10 times
agbench run --repeat 10 Tasks/human_eval_MagenticOne.jsonl

# Run on 70% of tasks
agbench run --subsample 0.7 Tasks/human_eval_MagenticOne.jsonl

# Run only 5 random tasks
agbench run --subsample 5 Tasks/human_eval_MagenticOne.jsonl

# Use specific model
agbench run --model gpt-4 Tasks/human_eval_MagenticOne.jsonl

# Use custom Docker image
agbench run --docker-image my-custom-image:latest Tasks/human_eval_MagenticOne.jsonl

agbench tabulate

Tabulate and analyze benchmark results.

agbench tabulate results_directory

Example

# View summary of results
agbench tabulate Results/human_eval_MagenticOne

agbench remove_missing

Remove missing or incomplete results from the results directory.

agbench remove_missing results_directory

Results Structure

AutoGenBench stores results in a hierarchical folder structure:

./results/[scenario]/[task_id]/[instance_id]

Example Structure

./results/default_two_agents/two_agent_stocks/0
./results/default_two_agents/two_agent_stocks/1
...
./results/default_two_agents/two_agent_stocks/9

scenario: The benchmark scenario being run
task_id: Maps to a specific prompt or set of parameters
instance_id: A specific attempt or run (0-9 for 10 repetitions)

Result Files

Each result directory contains:

timestamp.txt

Records the date and time of the run, along with the version of the autogen-agentchat library installed.

console_log.txt

All console output produced by Docker when running AutoGen. Read this like you would a regular console.

[agent]_messages.json

For each agent, a log of their message dictionaries showing the conversation flow.

./coding directory

Contains all code written by AutoGen and all artifacts produced by that code.

Built-in Benchmarks

HumanEval

Code generation benchmark with programming problems

GAIA

General AI assistants benchmark for complex reasoning tasks

AssistantBench

Assistant capabilities evaluation across various domains

Each benchmark has its own README in the benchmarks/ directory with specific instructions and requirements.

Creating Custom Benchmarks

To define your own tasks or benchmarks, review the contributor’s guide for complete technical details on:

Task definition format (JSONL)
Scenario templates
Custom evaluation metrics
Benchmark contribution guidelines

Best Practices

Use Docker

Always run benchmarks in Docker for consistency and safety

Multiple Runs

Use --repeat to run multiple iterations for statistical significance

Subsample First

Test with --subsample on a small set before running full benchmarks

Monitor Logs

Review console logs to understand agent behavior and failures

Troubleshooting

Docker Not Running

Error: Cannot connect to Docker daemon

Solution: Ensure Docker Desktop is running and accessible.

Missing AUTOGEN_REPO_BASE

Error: AUTOGEN_REPO_BASE environment variable not set

Solution: Export the path to your AutoGen repository:

export AUTOGEN_REPO_BASE=/path/to/autogen

API Key Issues

Error: No API key found

Solution: Set OAI_CONFIG_LIST environment variable or file, or use OPENAI_API_KEY.

Get Help

For detailed help on any command:

agbench --help
agbench run --help
agbench tabulate --help
agbench remove_missing --help

Getting Started

AgentChat

Core API

Extensions

Developer Tools

Guides

​Key Features

Reproducible Testing

Comprehensive Logging

Built-in Benchmarks

Metrics Analysis

​Requirements

​Docker Installation

​WSL Setup (Windows)

​Installation

​API Key Configuration

​Additional API Keys

​Quick Start

​Command Reference

​agbench run

​Examples

​agbench tabulate

​Example

​agbench remove_missing

​Results Structure

​Example Structure

​Result Files

​Built-in Benchmarks

HumanEval

GAIA

AssistantBench

​Creating Custom Benchmarks

​Best Practices

Use Docker

Multiple Runs

Subsample First

Monitor Logs

​Troubleshooting

​Docker Not Running

​Missing AUTOGEN_REPO_BASE

​API Key Issues

​Get Help

​Resources

GitHub Repository

Contributing Guide

Build docs developers (and LLMs) love

Key Features

Requirements

Docker Installation

WSL Setup (Windows)

Installation

API Key Configuration

Additional API Keys

Quick Start

Command Reference

agbench run

Examples

agbench tabulate

Example

agbench remove_missing

Results Structure

Example Structure

Result Files

Built-in Benchmarks

Creating Custom Benchmarks

Best Practices

Troubleshooting

Docker Not Running

Missing AUTOGEN_REPO_BASE

API Key Issues

Get Help

Resources