Skip to main content

Codenames AI Benchmark

Evaluate LLM performance through competitive Codenames gameplay. Test 50+ models across OpenAI, Anthropic, Google Gemini, xAI, DeepSeek, and OpenRouter with BAML-powered type-safe agents.

Key Features

BAML-Powered Agents

Universal LLM agents with type-safe structured outputs and automatic validation

Multi-Provider Support

50+ models from OpenAI, Anthropic, Google, xAI, DeepSeek, and OpenRouter

Comprehensive Benchmarking

Test model combinations with configurable game settings and analysis

Advanced Analytics

ELO ratings, win rates, efficiency metrics, momentum tracking, and error patterns

Quick Start

1

Install Dependencies

Clone the repository and install Python dependencies
git clone https://github.com/DeweyMarco/code-names-benchmark.git
cd code-names-benchmark
pip install -r requirements.txt
2

Configure API Keys

Set up your API keys for the providers you want to test
cp .env.example .env
# Edit .env with your API keys
3

Run Your First Game

Test with free OpenRouter models or your own
python demo_simple_game.py

Why Codenames?

Codenames is an ideal benchmark for evaluating language model capabilities:
  • Semantic Understanding — Models must understand word relationships and associations
  • Strategic Reasoning — Balancing risk vs. reward when giving hints
  • Communication — Coordinating between two models (hint giver and guesser)
  • Error Handling — Dealing with ambiguous hints and uncertain guesses

Get Started

Quickstart Guide

Get up and running in 5 minutes

Game Mechanics

Learn how the game works

Model Selection

Choose the right models for testing

API Reference

Explore the complete API

Build docs developers (and LLMs) love