Codenames AI Benchmark

Evaluate LLM performance through competitive Codenames gameplay. Test 50+ models across OpenAI, Anthropic, Google Gemini, xAI, DeepSeek, and OpenRouter with BAML-powered type-safe agents.

Get Started View on GitHub

Key Features

BAML-Powered Agents

Universal LLM agents with type-safe structured outputs and automatic validation

Multi-Provider Support

50+ models from OpenAI, Anthropic, Google, xAI, DeepSeek, and OpenRouter

Comprehensive Benchmarking

Test model combinations with configurable game settings and analysis

Advanced Analytics

ELO ratings, win rates, efficiency metrics, momentum tracking, and error patterns

Quick Start

Install Dependencies

Clone the repository and install Python dependencies

git clone https://github.com/DeweyMarco/code-names-benchmark.git
cd code-names-benchmark
pip install -r requirements.txt

Configure API Keys

Set up your API keys for the providers you want to test

cp .env.example .env
# Edit .env with your API keys

Run Your First Game

Test with free OpenRouter models or your own

python demo_simple_game.py

Why Codenames?

Codenames is an ideal benchmark for evaluating language model capabilities:

Semantic Understanding — Models must understand word relationships and associations
Strategic Reasoning — Balancing risk vs. reward when giving hints
Communication — Coordinating between two models (hint giver and guesser)
Error Handling — Dealing with ambiguous hints and uncertain guesses

Get Started

Quickstart Guide

Get up and running in 5 minutes

Game Mechanics

Learn how the game works

Model Selection

Choose the right models for testing

API Reference

Explore the complete API

⌘I

Get Started

Core Concepts

Guides

Advanced

Codenames AI Benchmark

Codenames AI Benchmark

Key Features

BAML-Powered Agents

Multi-Provider Support

Comprehensive Benchmarking

Advanced Analytics

Quick Start

Why Codenames?

Get Started

Quickstart Guide

Game Mechanics

Model Selection

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Codenames AI Benchmark

​Key Features

BAML-Powered Agents

Multi-Provider Support

Comprehensive Benchmarking

Advanced Analytics

​Quick Start

​Why Codenames?

​Get Started

Quickstart Guide

Game Mechanics

Model Selection

API Reference

Build docs developers (and LLMs) love

Codenames AI Benchmark

Key Features

Quick Start

Why Codenames?

Get Started