Overview
The dataset configuration system provides robust dataset management with integrity validation, automatic downloading, and support for both real and synthetic data.Dataset Specification
Datasets are defined using theDatasetSpec class (dataset_config.py:14-25):
Fashion-MNIST Specification
The default dataset specification (dataset_config.py:27-32):
Dataset Modes
Synthetic Mode
Generate random data for testing (train.py:33-41):
- Features: Random normal distribution, shape
(n_samples, n_features) - Labels: Random integers in
[0, n_classes) - Reproducible with seed
Real Dataset Mode
Load Fashion-MNIST CSV data:Loading Datasets
Basic Loading
Load and normalize a dataset (dataset_config.py:95-105):
- Parses CSV with header row
- Extracts labels from first column
- Extracts features from remaining columns
- Normalizes features by maximum value
- Returns
(X, y)tuple
Ensuring Dataset Readiness
Validate and optionally download dataset (dataset_config.py:129-159):
- Validates existing dataset
- If validation fails and
auto_download=True, downloads dataset - Re-validates after download
- Returns dataset dimensions
Dataset Validation
File-Level Checks
Validate dataset integrity (dataset_config.py:52-92):
Validation Checks
The validator performs:- File existence - Dataset file must exist
- Non-empty file - File size > 0 bytes
- SHA256 hash (optional) - Verify file integrity
- CSV parsing - Load data with comma delimiter
- Shape validation - Check
(*, expected_features + 1)shape - Minimum rows - Ensure sufficient samples
- NaN detection - No missing values allowed
- Label range - Labels must be in
[0, 9]
Hash Computation
Compute SHA256 digest for any file (dataset_config.py:43-49):
Automatic Download
Download Fashion-MNIST datasets (dataset_config.py:115-126):
- Downloads from
spec.download_base_url - Creates parent directories automatically
- Returns SHA256 hashes of downloaded files
- Times out after 120 seconds per file
Dataset Format
CSV Structure
Expected format:- First row: header (skipped during loading)
- First column: label (0-9)
- Remaining columns: pixel values (0-255)
- Total columns: 785 (1 label + 784 features)
Normalization
Features are normalized by maximum value:Train/Validation Split
The training script includes splitting logic (train.py:55-64):
Error Handling
Common Errors
FileNotFoundErrordataset_auto_prepare=True or run download script
ValueError: Dataset hash mismatch
dataset_min_rows
Usage in Training
Thetrain.py script integrates dataset loading (train.py:30-52):
Related
- Running Experiments - Configure and run training
- Reproducibility - Deterministic data generation