Camel AI Starter

A benchmarking tool built with the CAMEL framework that compares the performance of various AI models from different providers. Measures and visualizes model speed in tokens per second.

Features

Performance benchmarking across multiple models
Visual comparison with matplotlib
Support for OpenAI and Nebius models
Tokens per second metrics
Easy provider integration

Prerequisites

Python 3.11 or higher
uv - Fast Python package installer
OpenAI API key
Nebius API key

Installation

Clone the repository

git clone https://github.com/Arindam200/awesome-ai-apps.git
cd starter_ai_agents/camel_ai_starter

Create virtual environment

# Create virtual environment with uv
uv venv

# Activate the virtual environment
source .venv/bin/activate

Install dependencies

uv sync

Configure environment

Create a .env file with your API keys:

OPENAI_API_KEY="your-openai-api-key"
NEBIUS_API_KEY="your-nebius-api-key"

Implementation

Model Configuration

Set up multiple models for comparison:

agent.py

import time
import matplotlib.pyplot as plt
from camel.agents import ChatAgent
from camel.configs import NebiusConfig, ChatGPTConfig
from camel.messages import BaseMessage
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from dotenv import load_dotenv

load_dotenv()

def create_models():
    model_configs = [
        # OpenAI Models
        (ModelPlatformType.OPENAI, ModelType.GPT_4O_MINI, 
         ChatGPTConfig(temperature=0.0, max_tokens=2000), 
         "OpenAI GPT-4O Mini"),
        (ModelPlatformType.OPENAI, ModelType.GPT_4O, 
         ChatGPTConfig(temperature=0.0, max_tokens=2000), 
         "OpenAI GPT-4O"),
        
        # Nebius Models
        (ModelPlatformType.NEBIUS, "moonshotai/Kimi-K2-Instruct", 
         NebiusConfig(temperature=0.0, max_tokens=2000), 
         "Nebius Kimi-K2-Instruct"),
        (ModelPlatformType.NEBIUS, "Qwen/Qwen3-Coder-480B-A35B-Instruct", 
         NebiusConfig(temperature=0.0, max_tokens=2000), 
         "Nebius Qwen3-Coder-480B-A35B-Instruct"),
        (ModelPlatformType.NEBIUS, "zai-org/GLM-4.5-Air", 
         NebiusConfig(temperature=0.0, max_tokens=2000), 
         "Nebius GLM-4.5-Air")
    ]

    models = [
        (ModelFactory.create(
            model_platform=platform, 
            model_type=model_type, 
            model_config_dict=config.as_dict(), 
            url="https://api.tokenfactory.nebius.com/v1" if platform == ModelPlatformType.NEBIUS else None
        ), name)
        for platform, model_type, config, name in model_configs
    ]
    return models

Message Setup

agent.py

def create_messages():
    sys_msg = BaseMessage.make_assistant_message(
        role_name="Assistant", 
        content="You are a helpful assistant."
    )
    user_msg = BaseMessage.make_user_message(
        role_name="User", 
        content="Tell me a long story."
    )
    return sys_msg, user_msg

Agent Initialization

agent.py

def initialize_agents(models, sys_msg):
    return [
        (ChatAgent(system_message=sys_msg, model=model), name) 
        for model, name in models
    ]

Performance Measurement

agent.py

def measure_response_time(agent, message):
    start_time = time.time()
    response = agent.step(message)
    end_time = time.time()
    tokens_per_second = response.info['usage']["completion_tokens"] / (end_time - start_time)
    return tokens_per_second

Visualization

agent.py

def plot_results(model_names, tokens_per_sec):
    plt.figure(figsize=(10, 6))
    plt.barh(model_names, tokens_per_sec, color='skyblue')
    plt.xlabel("Tokens per Second")
    plt.title("Model Speed Comparison: Tokens per Second")
    plt.gca().invert_yaxis()
    plt.show()

Main Execution

agent.py

models = create_models()
sys_msg, user_msg = create_messages()
agents = initialize_agents(models, sys_msg)

# Measure response times
model_names = []
tokens_per_sec = []

for agent, model_name in agents:
    model_names.append(model_name)
    tokens_per_sec.append(measure_response_time(agent, user_msg))

# Visualize results
plot_results(model_names, tokens_per_sec)

Usage

Run the benchmarking tool:

uv run agent.py

The script will:

Initialize multiple AI models
Send the same test prompt to each
Measure response time and token generation speed
Generate a horizontal bar chart comparing performance

Technical Details

CAMEL Framework Components

ChatAgent

Agent class for model interaction

ModelFactory

Creates model instances for different providers

Configs

Provider-specific configuration classes

BaseMessage

Message structure for agent communication

Supported Models

OpenAI

GPT-4O Mini
GPT-4O

Nebius

Kimi-K2-Instruct
Qwen3-Coder-480B-A35B-Instruct
GLM-4.5-Air

Extending the Benchmark

Add More Models

model_configs = [
    # Existing models...
    
    # Add new model
    (ModelPlatformType.NEBIUS, "meta-llama/Meta-Llama-3.1-70B-Instruct", 
     NebiusConfig(temperature=0.0, max_tokens=2000), 
     "Nebius Llama 3.1 70B"),
]

Customize Benchmark Tests

def run_benchmark_suite(agent):
    """Run multiple benchmark tests"""
    tests = [
        "Write a short story",
        "Explain quantum computing",
        "Generate code for a sorting algorithm",
    ]
    
    results = []
    for test in tests:
        message = BaseMessage.make_user_message(role_name="User", content=test)
        tokens_per_sec = measure_response_time(agent, message)
        results.append(tokens_per_sec)
    
    return sum(results) / len(results)  # Average

Add More Metrics

def measure_detailed_performance(agent, message):
    start_time = time.time()
    response = agent.step(message)
    end_time = time.time()
    
    elapsed_time = end_time - start_time
    usage = response.info['usage']
    
    return {
        'tokens_per_second': usage["completion_tokens"] / elapsed_time,
        'total_time': elapsed_time,
        'completion_tokens': usage["completion_tokens"],
        'prompt_tokens': usage["prompt_tokens"],
        'total_tokens': usage["total_tokens"],
    }

Save Results

import json

def save_results(model_names, metrics, filename='benchmark_results.json'):
    results = {
        model: metric 
        for model, metric in zip(model_names, metrics)
    }
    
    with open(filename, 'w') as f:
        json.dump(results, f, indent=2)

Best Practices

Fair Comparison

Use same prompt for all models
Set consistent parameters (temperature, max_tokens)
Run multiple iterations for accuracy
Account for network variability

Model Configuration

Use temperature=0.0 for reproducibility
Set reasonable max_tokens limits
Consider cost vs. performance
Test with realistic prompts

Result Interpretation

Consider both speed and quality
Account for model size differences
Test with various prompt types
Monitor API rate limits

Visualization Customization

def plot_results_advanced(model_names, tokens_per_sec):
    plt.figure(figsize=(12, 8))
    
    # Create color map
    colors = ['#FF6B6B' if 'OpenAI' in name else '#4ECDC4' 
              for name in model_names]
    
    bars = plt.barh(model_names, tokens_per_sec, color=colors)
    
    # Add value labels
    for bar in bars:
        width = bar.get_width()
        plt.text(width, bar.get_y() + bar.get_height()/2,
                f'{width:.2f}',
                ha='left', va='center', fontsize=10)
    
    plt.xlabel("Tokens per Second", fontsize=12)
    plt.title("AI Model Performance Comparison", fontsize=14, fontweight='bold')
    plt.gca().invert_yaxis()
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.savefig('benchmark_results.png', dpi=300)
    plt.show()

Environment Variables

Variable	Description	Required
`OPENAI_API_KEY`	OpenAI API key	Yes (for OpenAI models)
`NEBIUS_API_KEY`	Nebius API key	Yes (for Nebius models)

Starter Agents

Simple Agents

MCP Agents

Memory Agents

RAG Applications

Advanced Agents

Features

Prerequisites

Installation

Implementation

Model Configuration

Message Setup

Agent Initialization

Performance Measurement

Visualization

Main Execution

Usage

Technical Details

CAMEL Framework Components

ChatAgent

ModelFactory

Configs

BaseMessage

Supported Models

Extending the Benchmark

Add More Models

Customize Benchmark Tests

Add More Metrics

Save Results

Best Practices

Visualization Customization

Environment Variables

Next Steps

Advanced Benchmarking

Cost Analysis

Build docs developers (and LLMs) love

Starter Agents

Simple Agents

MCP Agents

Memory Agents

RAG Applications

Advanced Agents

​Features

​Prerequisites

​Installation

​Implementation

​Model Configuration

​Message Setup

​Agent Initialization

​Performance Measurement

​Visualization

​Main Execution

​Usage

​Technical Details

​CAMEL Framework Components

ChatAgent

ModelFactory

Configs

BaseMessage

​Supported Models

​Extending the Benchmark

​Add More Models

​Customize Benchmark Tests

​Add More Metrics

​Save Results

​Best Practices

​Visualization Customization

​Environment Variables

​Next Steps

Advanced Benchmarking

Cost Analysis

Build docs developers (and LLMs) love

Features

Prerequisites

Installation

Implementation

Model Configuration

Message Setup

Agent Initialization

Performance Measurement

Visualization

Main Execution

Usage

Technical Details

CAMEL Framework Components

Supported Models

Extending the Benchmark

Add More Models

Customize Benchmark Tests

Add More Metrics

Save Results

Best Practices

Visualization Customization

Environment Variables

Next Steps