Programmatic Access

The RaceData project includes Python scripts for automated downloading and updating of Formula 1 datasets. This guide shows you how to use these scripts for programmatic access.

Overview

The repository includes two main scripts for programmatic data access:

download_datasets.py - Downloads F1 datasets from Kaggle and creates consolidated archives
upload_to_hf.py - Uploads datasets to HuggingFace Hub (for maintainers)

Prerequisites

Install Required Libraries

Install the necessary Python packages:

pip install kagglehub huggingface-hub

kagglehub is required for downloading from Kaggle, while huggingface-hub is needed for HuggingFace uploads.

Set Up Kaggle API Credentials

To download datasets from Kaggle, you need API credentials:

Go to your Kaggle account settings
Scroll to the “API” section
Click “Create New Token” to download kaggle.json
Place the file in ~/.kaggle/kaggle.json (Linux/Mac) or C:\Users\<Windows-username>\.kaggle\kaggle.json (Windows)

# Linux/Mac
mkdir -p ~/.kaggle
cp /path/to/downloaded/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Keep your kaggle.json file secure and never commit it to version control.

Using download_datasets.py

The download_datasets.py script automates the process of downloading Formula 1 datasets from Kaggle.

Basic Usage

# Clone the repository
git clone https://github.com/TracingInsights/RaceData.git
cd RaceData

# Run the download script
python download_datasets.py

How It Works

The script performs the following operations:

Download from Kaggle

Downloads the latest versions of two Kaggle datasets:

jtrotman/formula-1-race-data
jtrotman/formula-1-race-events

Copy to Data Directory

Copies all downloaded CSV files to the data/ directory in the repository.

Create Zip Archive

Creates a consolidated data.zip file containing all datasets.

Script Source Code

Here’s the core functionality from download_datasets.py:

def download_dataset(dataset_path: str, target_dir: Path) -> bool:
    """
    Download a Kaggle dataset and copy to target directory.

    Args:
        dataset_path: Kaggle dataset path (e.g., 'user/dataset-name')
        target_dir: Directory to copy files to

    Returns:
        True if successful, False otherwise
    """
    print(f"\n{'=' * 60}")
    print(f"Downloading dataset: {dataset_path}")
    print(f"{'=' * 60}")

    try:
        # Download using kagglehub (downloads to cache)
        download_path = kagglehub.dataset_download(dataset_path)
        print(f"✓ Downloaded to cache: {download_path}")

        # Copy files from cache to target directory
        source_path = Path(download_path)
        if not source_path.exists():
            print(f"✗ Error: Downloaded path does not exist: {source_path}")
            return False

        # Copy all files from the downloaded dataset
        files_copied = 0
        for file_path in source_path.rglob("*"):
            if file_path.is_file():
                # Preserve relative structure if needed, or flatten
                relative_path = file_path.relative_to(source_path)
                target_file = target_dir / relative_path

                # Create parent directories if needed
                target_file.parent.mkdir(parents=True, exist_ok=True)

                # Copy file
                shutil.copy2(file_path, target_file)
                print(f"  → Copied: {relative_path}")
                files_copied += 1

        print(f"✓ Copied {files_copied} file(s) from {dataset_path}")
        return True

    except Exception as e:
        print(f"✗ Error downloading {dataset_path}: {e}")
        return False

Custom Integration

You can integrate the download functionality into your own Python projects:

Example: Custom Download Script

custom_download.py

import kagglehub
from pathlib import Path
import shutil

def download_f1_data(output_dir: str = "./f1_data"):
    """
    Download Formula 1 datasets to a custom directory.
    
    Args:
        output_dir: Directory to save the downloaded files
    """
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Download F1 race data
    print("Downloading Formula 1 race data...")
    race_data = kagglehub.dataset_download("jtrotman/formula-1-race-data")
    
    # Copy files to output directory
    source = Path(race_data)
    for file in source.glob("*.csv"):
        dest = output_path / file.name
        shutil.copy2(file, dest)
        print(f"Copied: {file.name}")
    
    print(f"\nData downloaded to: {output_path.absolute()}")

if __name__ == "__main__":
    download_f1_data()

Example: Selective Download

selective_download.py

import kagglehub
from pathlib import Path
import shutil
import pandas as pd

def download_specific_tables(tables: list[str], output_dir: str = "./f1_data"):
    """
    Download only specific F1 data tables.
    
    Args:
        tables: List of CSV filenames to download (e.g., ['drivers.csv', 'races.csv'])
        output_dir: Directory to save the files
    """
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Download the dataset
    print("Downloading Formula 1 datasets...")
    data_path = kagglehub.dataset_download("jtrotman/formula-1-race-data")
    source = Path(data_path)
    
    # Copy only specified tables
    for table in tables:
        src_file = source / table
        if src_file.exists():
            dest_file = output_path / table
            shutil.copy2(src_file, dest_file)
            print(f"✓ Copied: {table}")
            
            # Optionally load into pandas
            df = pd.read_csv(dest_file)
            print(f"  Rows: {len(df):,}")
        else:
            print(f"✗ Not found: {table}")

if __name__ == "__main__":
    # Download only drivers, races, and results
    download_specific_tables([
        'drivers.csv',
        'races.csv',
        'results.csv',
        'lap_times.csv'
    ])

Automated Updates

The RaceData repository uses GitHub Actions to automatically update the dataset within 3 hours after each race. You can implement similar automation:

Example: Scheduled Updates with Cron

scheduled_update.py

import schedule
import time
from datetime import datetime
import subprocess

def update_f1_data():
    """
    Run the download script to update F1 data.
    """
    print(f"\n[{datetime.now()}] Starting F1 data update...")
    
    try:
        # Run the download script
        result = subprocess.run(
            ['python', 'download_datasets.py'],
            capture_output=True,
            text=True,
            check=True
        )
        print(result.stdout)
        print(f"✓ Update completed successfully")
    except subprocess.CalledProcessError as e:
        print(f"✗ Update failed: {e}")
        print(e.stderr)

# Schedule updates every Monday at 10:00 AM
schedule.every().monday.at("10:00").do(update_f1_data)

# Or schedule after every race (customize based on F1 calendar)
schedule.every().sunday.at("20:00").do(update_f1_data)  # After typical race time

print("F1 Data Update Scheduler Started")
print("Press Ctrl+C to stop")

while True:
    schedule.run_pending()
    time.sleep(60)  # Check every minute

Upload to HuggingFace (Maintainers)

For maintainers who want to upload data to HuggingFace, use the upload_to_hf.py script:

Setup

# Install HuggingFace Hub library
pip install huggingface-hub

# Set environment variables
export HF_TOKEN="your_huggingface_token"
export HF_REPO_ID="username/dataset-name"

# Run upload script
python upload_to_hf.py

Upload Function

From upload_to_hf.py:

def upload_to_huggingface(
    source_dir: Path, repo_id: str, token: str | None = None
) -> bool:
    """
    Upload datasets to HuggingFace Hub.

    Args:
        source_dir: Directory containing files to upload
        repo_id: HuggingFace repository ID (e.g., 'username/dataset-name')
        token: HuggingFace API token (if None, will use HF_TOKEN env var)

    Returns:
        True if successful, False otherwise
    """
    print(f"\n{'=' * 60}")
    print(f"Uploading to HuggingFace: {repo_id}")
    print(f"{'=' * 60}")

    # Check if token is provided
    if token is None:
        token = os.environ.get("HF_TOKEN")

    if not token:
        print("✗ No HuggingFace token provided. Skipping upload.")
        return False

    try:
        api = HfApi()

        # Check if dataset exists, create if not
        try:
            api.dataset_info(repo_id, token=token)
            print(f"✓ Dataset repository exists: {repo_id}")
        except Exception:
            print(f"  Creating new dataset repository: {repo_id}")
            api.create_repo(
                repo_id=repo_id, repo_type="dataset", token=token, exist_ok=True
            )

        # Upload all files from the data directory
        upload_folder(
            folder_path=str(source_dir),
            repo_id=repo_id,
            repo_type="dataset",
            token=token,
            commit_message=f"Update F1 datasets - {os.environ.get('COMMIT_DATE', 'manual update')}",
        )

        print(f"✓ Successfully uploaded to HuggingFace")
        print(f"  View at: https://huggingface.co/datasets/{repo_id}")
        return True

    except Exception as e:
        print(f"✗ Error uploading to HuggingFace: {e}")
        return False

Troubleshooting

Kaggle API credentials not found

Ensure your kaggle.json file is in the correct location:

# Check if file exists
ls -la ~/.kaggle/kaggle.json

# Verify permissions (should be 600)
chmod 600 ~/.kaggle/kaggle.json

Dataset download fails

If the download fails, try:

Verify your Kaggle API credentials are valid
Check your internet connection
Ensure you’ve accepted the dataset license on Kaggle’s website
Update kagglehub to the latest version: pip install --upgrade kagglehub

Permission denied errors

Make sure you have write permissions to the output directory:

# Create directory with proper permissions
mkdir -p ./data
chmod 755 ./data

Next Steps

Direct Download

Download the dataset as a zip file

HuggingFace Access

Use HuggingFace Datasets library

Quick Start

Start analyzing F1 data in minutes

Data Schema

Learn about the data structure

Get Started

Data Access

Data Schema

Guides

Programmatic Access

Overview

Prerequisites

Using download_datasets.py

Basic Usage

How It Works

Script Source Code

Custom Integration

Example: Custom Download Script

Example: Selective Download

Automated Updates

Example: Scheduled Updates with Cron

Upload to HuggingFace (Maintainers)

Setup

Upload Function

Troubleshooting

Next Steps

Direct Download

HuggingFace Access

Quick Start

Data Schema

Build docs developers (and LLMs) love

Get Started

Data Access

Data Schema

Guides

​Overview

​Prerequisites

​Using download_datasets.py

​Basic Usage

​How It Works

​Script Source Code

​Custom Integration

​Example: Custom Download Script

​Example: Selective Download

​Automated Updates

​Example: Scheduled Updates with Cron

​Upload to HuggingFace (Maintainers)

​Setup

​Upload Function

​Troubleshooting

​Next Steps

Direct Download

HuggingFace Access

Quick Start

Data Schema

Build docs developers (and LLMs) love

Overview

Prerequisites

Using download_datasets.py

Basic Usage

How It Works

Script Source Code

Custom Integration

Example: Custom Download Script

Example: Selective Download

Automated Updates

Example: Scheduled Updates with Cron

Upload to HuggingFace (Maintainers)

Setup

Upload Function

Troubleshooting

Next Steps