Overview
Thedownload_datasets.py script automates the process of downloading Formula 1 datasets from Kaggle and packaging them into a consolidated zip file. This script is designed to run in GitHub Actions workflows with change detection capabilities.
What It Does
- Downloads the latest versions of F1 datasets from Kaggle using the kagglehub library
- Copies downloaded files to the local
data/directory - Creates a compressed zip archive (
data.zip) containing all dataset files - Provides detailed console output for monitoring the download and archiving process
Dataset Sources
The script downloads from two primary Kaggle datasets:jtrotman/formula-1-race-data- Core F1 race data tablesjtrotman/formula-1-race-events- F1 race event information
Functions
download_dataset
Downloads a Kaggle dataset and copies it to the target directory.Kaggle dataset path in the format
user/dataset-name (e.g., ‘jtrotman/formula-1-race-data’)Directory to copy the downloaded files to
Returns
True if successful, False otherwiseExample Usage
How It Works
create_zip_archive
Creates a compressed zip archive of all files in the source directory.Directory containing files to compress
Output path for the zip file
Returns
True if successful, False otherwiseExample Usage
Implementation Details
main
Main execution function that orchestrates the download and archiving process.Returns
0 on success, 1 on failureWorkflow
- Creates the
data/directory if it doesn’t exist - Downloads all configured Kaggle datasets
- Validates that files were downloaded successfully
- Creates a consolidated
data.ziparchive - Reports success/failure status
Usage
Running the Script
Command Line Execution
The script can be executed directly:Output Structure
After execution, you’ll have:Configuration
Kaggle Credentials
The script requires Kaggle API credentials to download datasets. Set up your credentials:- Create a Kaggle account at kaggle.com
- Go to Account Settings → API → Create New Token
- This downloads
kaggle.jsonwith your credentials - Place the file at
~/.kaggle/kaggle.json
Adding New Datasets
To download additional datasets, modify thedatasets list in the main() function:
Dependencies
Frompyproject.toml:
Required Packages
- kagglehub (0.3.13) - Kaggle dataset download library
- pathlib - File path operations (Python standard library)
- zipfile - ZIP archive creation (Python standard library)
- shutil - High-level file operations (Python standard library)
GitHub Actions Integration
This script is designed to work seamlessly in GitHub Actions workflows:Error Handling
The script includes comprehensive error handling:- Missing datasets: Reports which datasets failed to download
- Empty data directory: Exits if no files were downloaded
- Zip creation failure: Reports if archive creation fails
- Partial success: Continues processing even if some datasets fail
