download_datasets.py

Overview

The download_datasets.py script automates the process of downloading Formula 1 datasets from Kaggle and packaging them into a consolidated zip file. This script is designed to run in GitHub Actions workflows with change detection capabilities.

What It Does

Downloads the latest versions of F1 datasets from Kaggle using the kagglehub library
Copies downloaded files to the local data/ directory
Creates a compressed zip archive (data.zip) containing all dataset files
Provides detailed console output for monitoring the download and archiving process

Dataset Sources

The script downloads from two primary Kaggle datasets:

jtrotman/formula-1-race-data - Core F1 race data tables
jtrotman/formula-1-race-events - F1 race event information

Functions

download_dataset

Downloads a Kaggle dataset and copies it to the target directory.

dataset_path

str

required

Kaggle dataset path in the format user/dataset-name (e.g., ‘jtrotman/formula-1-race-data’)

target_dir

Path

required

Directory to copy the downloaded files to

returns

bool

Returns True if successful, False otherwise

Example Usage

from pathlib import Path
from download_datasets import download_dataset

data_dir = Path("./data")
data_dir.mkdir(exist_ok=True)

success = download_dataset("jtrotman/formula-1-race-data", data_dir)
if success:
    print("Dataset downloaded successfully")

How It Works

def download_dataset(dataset_path: str, target_dir: Path) -> bool:
    # Download using kagglehub (downloads to cache)
    download_path = kagglehub.dataset_download(dataset_path)
    
    # Copy files from cache to target directory
    source_path = Path(download_path)
    files_copied = 0
    for file_path in source_path.rglob("*"):
        if file_path.is_file():
            relative_path = file_path.relative_to(source_path)
            target_file = target_dir / relative_path
            target_file.parent.mkdir(parents=True, exist_ok=True)
            shutil.copy2(file_path, target_file)
            files_copied += 1

create_zip_archive

Creates a compressed zip archive of all files in the source directory.

source_dir

Path

required

Directory containing files to compress

zip_path

Path

required

Output path for the zip file

returns

bool

Returns True if successful, False otherwise

Example Usage

from pathlib import Path
from download_datasets import create_zip_archive

data_dir = Path("./data")
zip_file = Path("./data.zip")

success = create_zip_archive(data_dir, zip_file)
if success:
    print(f"Archive created: {zip_file}")

Implementation Details

def create_zip_archive(source_dir: Path, zip_path: Path) -> bool:
    # Remove existing zip if present
    if zip_path.exists():
        zip_path.unlink()
    
    # Create new zip archive
    with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
        files_added = 0
        for file_path in sorted(source_dir.rglob("*")):
            if file_path.is_file():
                arcname = file_path.relative_to(source_dir)
                zipf.write(file_path, arcname)
                files_added += 1

main

Main execution function that orchestrates the download and archiving process.

returns

int

Returns 0 on success, 1 on failure

Workflow

Creates the data/ directory if it doesn’t exist
Downloads all configured Kaggle datasets
Validates that files were downloaded successfully
Creates a consolidated data.zip archive
Reports success/failure status

Usage

Running the Script

python download_datasets.py

Command Line Execution

The script can be executed directly:

./download_datasets.py

Output Structure

After execution, you’ll have:

project/
├── data/                    # Downloaded CSV files
│   ├── circuits.csv
│   ├── constructors.csv
│   ├── drivers.csv
│   └── ...
└── data.zip                 # Compressed archive

Configuration

Kaggle Credentials

The script requires Kaggle API credentials to download datasets. Set up your credentials:

Create a Kaggle account at kaggle.com
Go to Account Settings → API → Create New Token
This downloads kaggle.json with your credentials
Place the file at ~/.kaggle/kaggle.json

Or set environment variables:

export KAGGLE_USERNAME="your-username"
export KAGGLE_KEY="your-api-key"

Adding New Datasets

To download additional datasets, modify the datasets list in the main() function:

datasets = [
    "jtrotman/formula-1-race-data",
    "jtrotman/formula-1-race-events",
    "your-user/your-dataset"  # Add new datasets here
]

Dependencies

From pyproject.toml:

dependencies = [
    "kagglehub==0.3.13",
    "huggingface-hub==0.35.3",
]

Required Packages

kagglehub (0.3.13) - Kaggle dataset download library
pathlib - File path operations (Python standard library)
zipfile - ZIP archive creation (Python standard library)
shutil - High-level file operations (Python standard library)

GitHub Actions Integration

This script is designed to work seamlessly in GitHub Actions workflows:

- name: Download datasets
  env:
    KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }}
    KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }}
  run: python download_datasets.py

Error Handling

The script includes comprehensive error handling:

Missing datasets: Reports which datasets failed to download
Empty data directory: Exits if no files were downloaded
Zip creation failure: Reports if archive creation fails
Partial success: Continues processing even if some datasets fail

Console Output

The script provides detailed progress information:

Formula 1 Dataset Download Script
============================================================
Data directory: /path/to/data

============================================================
Downloading dataset: jtrotman/formula-1-race-data
============================================================
✓ Downloaded to cache: /cache/path
  → Copied: circuits.csv
  → Copied: constructors.csv
✓ Copied 18 file(s) from jtrotman/formula-1-race-data

============================================================
Download Summary: 2/2 datasets successful
============================================================

Found 18 file(s) in data directory

============================================================
Creating zip archive: data.zip
============================================================
✓ Created zip with 18 file(s) (4.25 MB)

============================================================
✓ All operations completed successfully!
============================================================

Tables

Scripts

download_datasets.py

Overview

What It Does

Dataset Sources

Functions

download_dataset

Example Usage

How It Works

create_zip_archive

Example Usage

Implementation Details

main

Workflow

Usage

Running the Script

Command Line Execution

Output Structure

Configuration

Kaggle Credentials

Adding New Datasets

Dependencies

Required Packages

GitHub Actions Integration

Error Handling

Console Output

Build docs developers (and LLMs) love

Tables

Scripts

​Overview

​What It Does

​Dataset Sources

​Functions

​download_dataset

​Example Usage

​How It Works

​create_zip_archive

​Example Usage

​Implementation Details

​main

​Workflow

​Usage

​Running the Script

​Command Line Execution

​Output Structure

​Configuration

​Kaggle Credentials

​Adding New Datasets

​Dependencies

​Required Packages

​GitHub Actions Integration

​Error Handling

​Console Output

Build docs developers (and LLMs) love

Overview

What It Does

Dataset Sources

Functions

download_dataset

Example Usage

How It Works

create_zip_archive

Example Usage

Implementation Details

main

Workflow

Usage

Running the Script

Command Line Execution

Output Structure

Configuration

Kaggle Credentials

Adding New Datasets

Dependencies

Required Packages

GitHub Actions Integration

Error Handling

Console Output