Overview
Thenanochat.core_eval module provides functions for evaluating language models on the CORE benchmark, as described in the DCLM paper.
evaluate_core
Evaluate a model on a CORE benchmark task.evaluate_task - the main entry point for CORE evaluation.
Core Functions
evaluate_task
Evaluate one task across many examples with distributed support.Parameters
Language model to evaluate
Tokenizer for encoding prompts
List of task examples. Format depends on Schema:Language modeling:
task_type:Multiple choice:Device to run evaluation on
Task metadata containing:
task_type(str): One of'multiple_choice','schema', or'language_modeling'num_fewshot(int): Number of few-shot examples to includecontinuation_delimiter(str): Delimiter between context and continuation (e.g.,' 'or'\n')
Returns
Mean accuracy across all examples (0.0 to 1.0)
evaluate_example
Evaluate a single example.Parameters
Same asevaluate_task, plus:
Index of the example to evaluate in
dataReturns
Whether the model’s prediction was correct
Task Types
Multiple Choice
Model chooses among multiple options based on which has the lowest average loss. Format:- Render all choices with the query prefix
- Forward each choice through the model
- Select choice with lowest average loss on the continuation tokens
Schema
Model selects the correct context that leads to a given continuation. Format:- Render all contexts with the same continuation
- Forward each option through the model
- Select context with lowest average loss on the continuation tokens
Language Modeling
Model must correctly predict all tokens in the continuation. Format:- Render context + continuation
- Forward through the model
- Check if argmax predictions match all continuation tokens
Prompt Rendering
render_prompts_mc
Render prompts for multiple choice questions.render_prompts_schema
Render prompts for schema questions.render_prompts_lm
Render prompts for language modeling tasks.[prompt_without_continuation, prompt_with_continuation].
Utility Functions
find_common_length
Find the length of common prefix or suffix across token sequences.token_sequences: List of tokenized sequencesdirection:'left'for prefix,'right'for suffix
stack_sequences
Stack token sequences with padding.forward_model
Forward model and compute losses and predictions.losses: (B, T) tensor of cross-entropy lossespredictions: (B, T) tensor of argmax predictions
Example Usage
Distributed Evaluation
When running withtorchrun, the evaluation automatically distributes examples across ranks:
Notes
- Few-shot examples are sampled randomly with seed
1234 + idxfor reproducibility - Models with
max_seq_lenattribute will have prompts truncated to that length - Sequences are truncated from the left to preserve the continuation tokens
- Uses BOS token as pad token during batching