Overview
The CORE metric is a standardized benchmark for evaluating language models, introduced in the DataComp-LM (DCLM) paper. It measures model performance across multiple in-context learning (ICL) tasks and produces a single, centered accuracy score. GPT-2 Baseline: 0.256525What is CORE?
CORE evaluates models on a suite of tasks covering:- Reading comprehension
- Common sense reasoning
- World knowledge
- Question answering
- 0.0 = Random guessing
- 1.0 = Perfect performance
How CORE is Calculated
1. Evaluate Individual Tasks
For each task, the model is evaluated using few-shot prompting:task_type: Type of evaluation (multiple_choice, schema, language_modeling)num_fewshot: Number of examples to include in the promptcontinuation_delimiter: Separator between context and answer (usually a space)
2. Center Against Random Baseline
Each task’s raw accuracy is centered:random_baseline is the expected accuracy from random guessing (e.g., 25% for 4-choice multiple choice).
3. Average Centered Results
The CORE metric is the mean of all centered results:Task Types
CORE supports three evaluation paradigms:Multiple Choice
The model chooses between several answer options. All prompts share the same context but have different continuations.Schema
The model evaluates different contexts leading to the same continuation.Language Modeling
The model predicts a continuation given a context.Evaluation Process
Per-Example Evaluation
Each example is evaluated individually:- Sample few-shot examples (deterministically seeded)
- Render prompts based on task type
- Tokenize and batch sequences
- Forward through the model
- Compute losses and predictions
- Determine correctness based on task type
Distributed Evaluation
CORE evaluation supports multi-GPU parallelism:Running CORE Evaluation
Evaluate a nanochat Model
Evaluate a HuggingFace Model
Quick Evaluation (Single GPU)
For faster approximate results:Implementation Details
Tokenization
Prompts are tokenized with BOS prepended:Sequence Truncation
For models with maximum sequence length (e.g., GPT-2 with 1024 tokens):Loss Calculation
Cross-entropy loss is computed autoregressively:Output Format
Results are saved to CSV:Reference
Based on the DataComp-LM paper: https://arxiv.org/abs/2406.11794 Implementation:nanochat/core_eval.py:evaluate_task