Skip to main content
A benchmarking tool built with the CAMEL framework that compares the performance of various AI models from different providers. Measures and visualizes model speed in tokens per second.

Features

  • Performance benchmarking across multiple models
  • Visual comparison with matplotlib
  • Support for OpenAI and Nebius models
  • Tokens per second metrics
  • Easy provider integration

Prerequisites

Installation

1

Clone the repository

git clone https://github.com/Arindam200/awesome-ai-apps.git
cd starter_ai_agents/camel_ai_starter
2

Create virtual environment

# Create virtual environment with uv
uv venv

# Activate the virtual environment
source .venv/bin/activate
3

Install dependencies

uv sync
4

Configure environment

Create a .env file with your API keys:
OPENAI_API_KEY="your-openai-api-key"
NEBIUS_API_KEY="your-nebius-api-key"

Implementation

Model Configuration

Set up multiple models for comparison:
agent.py
import time
import matplotlib.pyplot as plt
from camel.agents import ChatAgent
from camel.configs import NebiusConfig, ChatGPTConfig
from camel.messages import BaseMessage
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from dotenv import load_dotenv

load_dotenv()

def create_models():
    model_configs = [
        # OpenAI Models
        (ModelPlatformType.OPENAI, ModelType.GPT_4O_MINI, 
         ChatGPTConfig(temperature=0.0, max_tokens=2000), 
         "OpenAI GPT-4O Mini"),
        (ModelPlatformType.OPENAI, ModelType.GPT_4O, 
         ChatGPTConfig(temperature=0.0, max_tokens=2000), 
         "OpenAI GPT-4O"),
        
        # Nebius Models
        (ModelPlatformType.NEBIUS, "moonshotai/Kimi-K2-Instruct", 
         NebiusConfig(temperature=0.0, max_tokens=2000), 
         "Nebius Kimi-K2-Instruct"),
        (ModelPlatformType.NEBIUS, "Qwen/Qwen3-Coder-480B-A35B-Instruct", 
         NebiusConfig(temperature=0.0, max_tokens=2000), 
         "Nebius Qwen3-Coder-480B-A35B-Instruct"),
        (ModelPlatformType.NEBIUS, "zai-org/GLM-4.5-Air", 
         NebiusConfig(temperature=0.0, max_tokens=2000), 
         "Nebius GLM-4.5-Air")
    ]

    models = [
        (ModelFactory.create(
            model_platform=platform, 
            model_type=model_type, 
            model_config_dict=config.as_dict(), 
            url="https://api.tokenfactory.nebius.com/v1" if platform == ModelPlatformType.NEBIUS else None
        ), name)
        for platform, model_type, config, name in model_configs
    ]
    return models

Message Setup

agent.py
def create_messages():
    sys_msg = BaseMessage.make_assistant_message(
        role_name="Assistant", 
        content="You are a helpful assistant."
    )
    user_msg = BaseMessage.make_user_message(
        role_name="User", 
        content="Tell me a long story."
    )
    return sys_msg, user_msg

Agent Initialization

agent.py
def initialize_agents(models, sys_msg):
    return [
        (ChatAgent(system_message=sys_msg, model=model), name) 
        for model, name in models
    ]

Performance Measurement

agent.py
def measure_response_time(agent, message):
    start_time = time.time()
    response = agent.step(message)
    end_time = time.time()
    tokens_per_second = response.info['usage']["completion_tokens"] / (end_time - start_time)
    return tokens_per_second

Visualization

agent.py
def plot_results(model_names, tokens_per_sec):
    plt.figure(figsize=(10, 6))
    plt.barh(model_names, tokens_per_sec, color='skyblue')
    plt.xlabel("Tokens per Second")
    plt.title("Model Speed Comparison: Tokens per Second")
    plt.gca().invert_yaxis()
    plt.show()

Main Execution

agent.py
models = create_models()
sys_msg, user_msg = create_messages()
agents = initialize_agents(models, sys_msg)

# Measure response times
model_names = []
tokens_per_sec = []

for agent, model_name in agents:
    model_names.append(model_name)
    tokens_per_sec.append(measure_response_time(agent, user_msg))

# Visualize results
plot_results(model_names, tokens_per_sec)

Usage

Run the benchmarking tool:
uv run agent.py
The script will:
  1. Initialize multiple AI models
  2. Send the same test prompt to each
  3. Measure response time and token generation speed
  4. Generate a horizontal bar chart comparing performance

Technical Details

CAMEL Framework Components

ChatAgent

Agent class for model interaction

ModelFactory

Creates model instances for different providers

Configs

Provider-specific configuration classes

BaseMessage

Message structure for agent communication

Supported Models

OpenAI
  • GPT-4O Mini
  • GPT-4O
Nebius
  • Kimi-K2-Instruct
  • Qwen3-Coder-480B-A35B-Instruct
  • GLM-4.5-Air

Extending the Benchmark

Add More Models

model_configs = [
    # Existing models...
    
    # Add new model
    (ModelPlatformType.NEBIUS, "meta-llama/Meta-Llama-3.1-70B-Instruct", 
     NebiusConfig(temperature=0.0, max_tokens=2000), 
     "Nebius Llama 3.1 70B"),
]

Customize Benchmark Tests

def run_benchmark_suite(agent):
    """Run multiple benchmark tests"""
    tests = [
        "Write a short story",
        "Explain quantum computing",
        "Generate code for a sorting algorithm",
    ]
    
    results = []
    for test in tests:
        message = BaseMessage.make_user_message(role_name="User", content=test)
        tokens_per_sec = measure_response_time(agent, message)
        results.append(tokens_per_sec)
    
    return sum(results) / len(results)  # Average

Add More Metrics

def measure_detailed_performance(agent, message):
    start_time = time.time()
    response = agent.step(message)
    end_time = time.time()
    
    elapsed_time = end_time - start_time
    usage = response.info['usage']
    
    return {
        'tokens_per_second': usage["completion_tokens"] / elapsed_time,
        'total_time': elapsed_time,
        'completion_tokens': usage["completion_tokens"],
        'prompt_tokens': usage["prompt_tokens"],
        'total_tokens': usage["total_tokens"],
    }

Save Results

import json

def save_results(model_names, metrics, filename='benchmark_results.json'):
    results = {
        model: metric 
        for model, metric in zip(model_names, metrics)
    }
    
    with open(filename, 'w') as f:
        json.dump(results, f, indent=2)

Best Practices

  • Use same prompt for all models
  • Set consistent parameters (temperature, max_tokens)
  • Run multiple iterations for accuracy
  • Account for network variability
  • Use temperature=0.0 for reproducibility
  • Set reasonable max_tokens limits
  • Consider cost vs. performance
  • Test with realistic prompts
  • Consider both speed and quality
  • Account for model size differences
  • Test with various prompt types
  • Monitor API rate limits

Visualization Customization

def plot_results_advanced(model_names, tokens_per_sec):
    plt.figure(figsize=(12, 8))
    
    # Create color map
    colors = ['#FF6B6B' if 'OpenAI' in name else '#4ECDC4' 
              for name in model_names]
    
    bars = plt.barh(model_names, tokens_per_sec, color=colors)
    
    # Add value labels
    for bar in bars:
        width = bar.get_width()
        plt.text(width, bar.get_y() + bar.get_height()/2,
                f'{width:.2f}',
                ha='left', va='center', fontsize=10)
    
    plt.xlabel("Tokens per Second", fontsize=12)
    plt.title("AI Model Performance Comparison", fontsize=14, fontweight='bold')
    plt.gca().invert_yaxis()
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.savefig('benchmark_results.png', dpi=300)
    plt.show()

Environment Variables

VariableDescriptionRequired
OPENAI_API_KEYOpenAI API keyYes (for OpenAI models)
NEBIUS_API_KEYNebius API keyYes (for Nebius models)

Next Steps

Advanced Benchmarking

More comprehensive model comparison

Cost Analysis

Compare cost vs. performance

Build docs developers (and LLMs) love