AutoGenBench works with all AutoGen 0.1.* and 0.2.* versions.
Key Features
Reproducible Testing
Run agents in fresh Docker containers for consistent, isolated testing
Comprehensive Logging
Detailed logs of agent behavior, code execution, and task results
Built-in Benchmarks
Pre-configured benchmarks like HumanEval, GAIA, and AssistantBench
Metrics Analysis
Built-in tools to tabulate and analyze benchmark results
Requirements
Docker Installation
Install Docker Desktop from https://www.docker.com/products/docker-desktop/WSL Setup (Windows)
If you’re working in WSL:Enable WSL Integration
Open Docker Desktop → Settings → Resources → WSL IntegrationEnable integration with your Ubuntu distribution.
Installation
Install AutoGenBench from the source repository:API Key Configuration
AutoGenBench requires API keys for LLM access. Configure them using one of these methods:- Environment Variable
- Config File
- OpenAI API Key
Additional API Keys
Some benchmark scenarios require additional keys (e.g., Bing Search API). Add them to anENV.json file in your working folder:
ENV.json
Quick Start
Here’s a typical workflow for running the HumanEval benchmark:Command Reference
agbench run
Run benchmark scenarios with controlled initial conditions.The JSONL scenario file to run. If a directory is specified, all JSONL scenarios in the directory are run.
Environment variable name or path to the OAI_CONFIG_LIST
Number of repetitions to run for each scenario
Run on a subsample of tasks:
- Decimal (e.g.,
0.7): Run on 70% of tasks - Integer (e.g.,
7): Run exactly 7 tasks from each file
Filter config_list to include only models matching the provided name
Docker image to use when running scenarios. Cannot be used with
--native.Run scenarios natively rather than in Docker
Requirements file to pip install before running the scenario
Examples
agbench tabulate
Tabulate and analyze benchmark results.Example
agbench remove_missing
Remove missing or incomplete results from the results directory.Results Structure
AutoGenBench stores results in a hierarchical folder structure:Example Structure
- scenario: The benchmark scenario being run
- task_id: Maps to a specific prompt or set of parameters
- instance_id: A specific attempt or run (0-9 for 10 repetitions)
Result Files
Each result directory contains:timestamp.txt
timestamp.txt
Records the date and time of the run, along with the version of the autogen-agentchat library installed.
console_log.txt
console_log.txt
All console output produced by Docker when running AutoGen. Read this like you would a regular console.
[agent]_messages.json
[agent]_messages.json
For each agent, a log of their message dictionaries showing the conversation flow.
./coding directory
./coding directory
Contains all code written by AutoGen and all artifacts produced by that code.
Built-in Benchmarks
HumanEval
Code generation benchmark with programming problems
GAIA
General AI assistants benchmark for complex reasoning tasks
AssistantBench
Assistant capabilities evaluation across various domains
Each benchmark has its own README in the
benchmarks/ directory with specific instructions and requirements.Creating Custom Benchmarks
To define your own tasks or benchmarks, review the contributor’s guide for complete technical details on:- Task definition format (JSONL)
- Scenario templates
- Custom evaluation metrics
- Benchmark contribution guidelines
Best Practices
Use Docker
Always run benchmarks in Docker for consistency and safety
Multiple Runs
Use
--repeat to run multiple iterations for statistical significanceSubsample First
Test with
--subsample on a small set before running full benchmarksMonitor Logs
Review console logs to understand agent behavior and failures
Troubleshooting
Docker Not Running
Missing AUTOGEN_REPO_BASE
API Key Issues
OAI_CONFIG_LIST environment variable or file, or use OPENAI_API_KEY.
Get Help
For detailed help on any command:Resources
GitHub Repository
View source code and contribute
Contributing Guide
Learn how to create custom benchmarks