Codenames AI Benchmark
Evaluate LLM performance through competitive Codenames gameplay. Test 50+ models across OpenAI, Anthropic, Google Gemini, xAI, DeepSeek, and OpenRouter with BAML-powered type-safe agents.
Key Features
BAML-Powered Agents
Universal LLM agents with type-safe structured outputs and automatic validation
Multi-Provider Support
50+ models from OpenAI, Anthropic, Google, xAI, DeepSeek, and OpenRouter
Comprehensive Benchmarking
Test model combinations with configurable game settings and analysis
Advanced Analytics
ELO ratings, win rates, efficiency metrics, momentum tracking, and error patterns
Quick Start
Why Codenames?
Codenames is an ideal benchmark for evaluating language model capabilities:- Semantic Understanding — Models must understand word relationships and associations
- Strategic Reasoning — Balancing risk vs. reward when giving hints
- Communication — Coordinating between two models (hint giver and guesser)
- Error Handling — Dealing with ambiguous hints and uncertain guesses
Get Started
Quickstart Guide
Get up and running in 5 minutes
Game Mechanics
Learn how the game works
Model Selection
Choose the right models for testing
API Reference
Explore the complete API