lm_datasets module provides utilities for loading and preprocessing causal language modeling datasets from Hugging Face, including WikiText-2, TinyStories, OpenWebText, and Wikipedia.
load_causal_lm_dataset
Load and tokenize a single dataset for causal language modeling.Configuration object specifying dataset parameters
Tokenizer with
pad_token_id defined for processing textTokenized dataset with
input_ids, attention_mask, and labels fields. Labels have padding tokens masked with -100 to ignore them in loss computation.Returns
Returns a tokenizeddatasets.Dataset with the following columns:
input_ids: Tokenized input sequencesattention_mask: Attention masks (1 for real tokens, 0 for padding)labels: Target labels for causal LM (padding positions set to -100)
set_format(type="torch").
Complexity
O(num_examples · max_length) due to tokenization work.load_multi_dataset
Load and concatenate multiple datasets for large-scale pretraining.List of dataset identifiers. Can be keys from
DATASET_REGISTRY or Hugging Face dataset paths. Supports name:N syntax to cap individual datasets (e.g., "TinyStories:100000").Tokenizer to use for preprocessing
Dataset split to load (e.g., “train”, “validation”)
Maximum sequence length for tokenization
Global cap for all datasets. Per-dataset
:N suffix takes precedence.Concatenated and shuffled dataset combining all successfully loaded datasets
Dataset name syntax
Dataset names support optional sample limits using the:N suffix:
"wikitext-2-raw-v1"- Load all samples"roneneldan/TinyStories:100000"- Load maximum 100,000 samples- Per-dataset limits override the global
max_samples_per_datasetparameter
Returns
Returns a concatenated and shuffleddatasets.Dataset with all samples from the specified datasets. The combined dataset is shuffled with seed 42 for reproducibility.
LanguageModelingDatasetConfig
Configuration dataclass for causal language modeling datasets.Hugging Face dataset name (e.g., “wikitext”, “roneneldan/TinyStories”)
Dataset configuration name (e.g., “wikitext-2-raw-v1” for WikiText)
Dataset split to load (“train”, “validation”, or “test”)
Name of the column containing text data
Maximum sequence length for tokenization. Must be positive.
Number of processes for parallel tokenization. None uses single process.
Whether to stream the dataset. Currently not supported (raises NotImplementedError).
Validation
The config validates on initialization:dataset_namemust be non-emptymax_lengthmust be positive
DATASET_REGISTRY
Pre-configured mapping of common datasets to their Hugging Face paths and configurations.Registered datasets
WikiText-2 raw dataset:
("wikitext", "wikitext-2-raw-v1", "text")WikiText-103 raw dataset:
("wikitext", "wikitext-103-raw-v1", "text")TinyStories dataset:
("roneneldan/TinyStories", None, "text")OpenWebText dataset:
("Skylion007/openwebtext", None, "text")Wikipedia English dump:
("wikimedia/wikipedia", "20231101.en", "text")