Features
- Performance benchmarking across multiple models
- Visual comparison with matplotlib
- Support for OpenAI and Nebius models
- Tokens per second metrics
- Easy provider integration
Prerequisites
- Python 3.11 or higher
- uv - Fast Python package installer
- OpenAI API key
- Nebius API key
Installation
Implementation
Model Configuration
Set up multiple models for comparison:agent.py
Message Setup
agent.py
Agent Initialization
agent.py
Performance Measurement
agent.py
Visualization
agent.py
Main Execution
agent.py
Usage
Run the benchmarking tool:- Initialize multiple AI models
- Send the same test prompt to each
- Measure response time and token generation speed
- Generate a horizontal bar chart comparing performance
Technical Details
CAMEL Framework Components
ChatAgent
Agent class for model interaction
ModelFactory
Creates model instances for different providers
Configs
Provider-specific configuration classes
BaseMessage
Message structure for agent communication
Supported Models
OpenAI- GPT-4O Mini
- GPT-4O
- Kimi-K2-Instruct
- Qwen3-Coder-480B-A35B-Instruct
- GLM-4.5-Air
Extending the Benchmark
Add More Models
Customize Benchmark Tests
Add More Metrics
Save Results
Best Practices
Fair Comparison
Fair Comparison
- Use same prompt for all models
- Set consistent parameters (temperature, max_tokens)
- Run multiple iterations for accuracy
- Account for network variability
Model Configuration
Model Configuration
- Use temperature=0.0 for reproducibility
- Set reasonable max_tokens limits
- Consider cost vs. performance
- Test with realistic prompts
Result Interpretation
Result Interpretation
- Consider both speed and quality
- Account for model size differences
- Test with various prompt types
- Monitor API rate limits
Visualization Customization
Environment Variables
| Variable | Description | Required |
|---|---|---|
OPENAI_API_KEY | OpenAI API key | Yes (for OpenAI models) |
NEBIUS_API_KEY | Nebius API key | Yes (for Nebius models) |
Next Steps
Advanced Benchmarking
More comprehensive model comparison
Cost Analysis
Compare cost vs. performance