Overview
ChemLactica provides utilities for loading and preparing datasets for both pretraining and supervised fine-tuning. The dataset system supports streaming JSONL files for efficient memory usage during pretraining and standard HuggingFace datasets for SFT.get_dataset()
Main function for loading and preparing datasets based on training type.Function Signature
Parameters
Type of training. Options:
"pretrain" or "sft"List of directories containing training data files. For pretrain: directories with JSONL files. For SFT: path to HuggingFace dataset
Directory containing validation data files (JSONL format for pretrain)
List of data types corresponding to each training directory. Must match length of
training_data_dirs. Supported types defined in DIR_DATA_TYPESTraining configuration object containing training hyperparameters
Model configuration object containing model architecture parameters and tokenizer path
Shared dictionary for tracking JSONL file reading positions across processes (pretrain only). Pass
None for SFTIf True, only load validation dataset
If True and not
evaluate_only, skip loading validation dataset (evaluation runs separately via SLURM)Size of shuffle buffer for assay datasets during pretraining
Returns
Dictionary containing
"train" and "validation" datasets. For pretrain, returns iterable datasets. For SFT, returns standard HuggingFace dataset dictPretrain Mode
Whentrain_type="pretrain", the function:
- Validates that data types are supported
- Loads JSONL files from each training directory
- Creates iterable datasets using
samples_generator - Processes each dataset with tokenization and formatting
- Shuffles assay-type datasets with specified buffer size
- Interleaves multiple datasets if provided
- Loads validation dataset (unless
evaluate_only=Trueorslurm_eval=True)
SFT Mode
Whentrain_type="sft", the function loads a standard HuggingFace dataset:
Data Types
Supported data types are defined inDIR_DATA_TYPES. Common types include:
"molecules": Molecular structure data"assay_split": Assay data that should be shuffled- Other domain-specific types
shuffle_buffer_size.
samples_generator()
Generator function for streaming JSONL files in a distributed manner.Function Signature
Parameters
List of JSONL file paths to read
Shared dictionary for tracking file positions across processes, enabling checkpoint resumption
Size of chunks for reading files (currently not used in line-by-line reading)
Whether to return line position information with samples
Yields
Dictionary with
"text" key containing the line content from the JSONL fileFeatures
Distributed Reading
The generator distributes samples across multiple processes:Checkpoint Resumption
The generator tracks file positions inshared_jsonl_files:
Sample Format
Each line is formatted as:Usage Example
Dataset Processing
All datasets loaded byget_dataset() are processed using process_dataset from chemlactica.utils.dataset_utils, which handles:
- Tokenization with the specified tokenizer
- Sequence formatting and truncation
- Batching and padding
- Special handling for assay vs. non-assay data
Source Reference
get_dataset():chemlactica/get_dataset.py:10samples_generator():chemlactica/jsonl_dataset.py:37