Skip to main content
The RaceData project includes Python scripts for automated downloading and updating of Formula 1 datasets. This guide shows you how to use these scripts for programmatic access.

Overview

The repository includes two main scripts for programmatic data access:
  • download_datasets.py - Downloads F1 datasets from Kaggle and creates consolidated archives
  • upload_to_hf.py - Uploads datasets to HuggingFace Hub (for maintainers)

Prerequisites

1

Install Required Libraries

Install the necessary Python packages:
pip install kagglehub huggingface-hub
kagglehub is required for downloading from Kaggle, while huggingface-hub is needed for HuggingFace uploads.
2

Set Up Kaggle API Credentials

To download datasets from Kaggle, you need API credentials:
  1. Go to your Kaggle account settings
  2. Scroll to the “API” section
  3. Click “Create New Token” to download kaggle.json
  4. Place the file in ~/.kaggle/kaggle.json (Linux/Mac) or C:\Users\<Windows-username>\.kaggle\kaggle.json (Windows)
# Linux/Mac
mkdir -p ~/.kaggle
cp /path/to/downloaded/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
Keep your kaggle.json file secure and never commit it to version control.

Using download_datasets.py

The download_datasets.py script automates the process of downloading Formula 1 datasets from Kaggle.

Basic Usage

# Clone the repository
git clone https://github.com/TracingInsights/RaceData.git
cd RaceData

# Run the download script
python download_datasets.py

How It Works

The script performs the following operations:
1

Download from Kaggle

Downloads the latest versions of two Kaggle datasets:
  • jtrotman/formula-1-race-data
  • jtrotman/formula-1-race-events
2

Copy to Data Directory

Copies all downloaded CSV files to the data/ directory in the repository.
3

Create Zip Archive

Creates a consolidated data.zip file containing all datasets.

Script Source Code

Here’s the core functionality from download_datasets.py:
def download_dataset(dataset_path: str, target_dir: Path) -> bool:
    """
    Download a Kaggle dataset and copy to target directory.

    Args:
        dataset_path: Kaggle dataset path (e.g., 'user/dataset-name')
        target_dir: Directory to copy files to

    Returns:
        True if successful, False otherwise
    """
    print(f"\n{'=' * 60}")
    print(f"Downloading dataset: {dataset_path}")
    print(f"{'=' * 60}")

    try:
        # Download using kagglehub (downloads to cache)
        download_path = kagglehub.dataset_download(dataset_path)
        print(f"✓ Downloaded to cache: {download_path}")

        # Copy files from cache to target directory
        source_path = Path(download_path)
        if not source_path.exists():
            print(f"✗ Error: Downloaded path does not exist: {source_path}")
            return False

        # Copy all files from the downloaded dataset
        files_copied = 0
        for file_path in source_path.rglob("*"):
            if file_path.is_file():
                # Preserve relative structure if needed, or flatten
                relative_path = file_path.relative_to(source_path)
                target_file = target_dir / relative_path

                # Create parent directories if needed
                target_file.parent.mkdir(parents=True, exist_ok=True)

                # Copy file
                shutil.copy2(file_path, target_file)
                print(f"  → Copied: {relative_path}")
                files_copied += 1

        print(f"✓ Copied {files_copied} file(s) from {dataset_path}")
        return True

    except Exception as e:
        print(f"✗ Error downloading {dataset_path}: {e}")
        return False

Custom Integration

You can integrate the download functionality into your own Python projects:

Example: Custom Download Script

custom_download.py
import kagglehub
from pathlib import Path
import shutil

def download_f1_data(output_dir: str = "./f1_data"):
    """
    Download Formula 1 datasets to a custom directory.
    
    Args:
        output_dir: Directory to save the downloaded files
    """
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Download F1 race data
    print("Downloading Formula 1 race data...")
    race_data = kagglehub.dataset_download("jtrotman/formula-1-race-data")
    
    # Copy files to output directory
    source = Path(race_data)
    for file in source.glob("*.csv"):
        dest = output_path / file.name
        shutil.copy2(file, dest)
        print(f"Copied: {file.name}")
    
    print(f"\nData downloaded to: {output_path.absolute()}")

if __name__ == "__main__":
    download_f1_data()

Example: Selective Download

selective_download.py
import kagglehub
from pathlib import Path
import shutil
import pandas as pd

def download_specific_tables(tables: list[str], output_dir: str = "./f1_data"):
    """
    Download only specific F1 data tables.
    
    Args:
        tables: List of CSV filenames to download (e.g., ['drivers.csv', 'races.csv'])
        output_dir: Directory to save the files
    """
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Download the dataset
    print("Downloading Formula 1 datasets...")
    data_path = kagglehub.dataset_download("jtrotman/formula-1-race-data")
    source = Path(data_path)
    
    # Copy only specified tables
    for table in tables:
        src_file = source / table
        if src_file.exists():
            dest_file = output_path / table
            shutil.copy2(src_file, dest_file)
            print(f"✓ Copied: {table}")
            
            # Optionally load into pandas
            df = pd.read_csv(dest_file)
            print(f"  Rows: {len(df):,}")
        else:
            print(f"✗ Not found: {table}")

if __name__ == "__main__":
    # Download only drivers, races, and results
    download_specific_tables([
        'drivers.csv',
        'races.csv',
        'results.csv',
        'lap_times.csv'
    ])

Automated Updates

The RaceData repository uses GitHub Actions to automatically update the dataset within 3 hours after each race. You can implement similar automation:

Example: Scheduled Updates with Cron

scheduled_update.py
import schedule
import time
from datetime import datetime
import subprocess

def update_f1_data():
    """
    Run the download script to update F1 data.
    """
    print(f"\n[{datetime.now()}] Starting F1 data update...")
    
    try:
        # Run the download script
        result = subprocess.run(
            ['python', 'download_datasets.py'],
            capture_output=True,
            text=True,
            check=True
        )
        print(result.stdout)
        print(f"✓ Update completed successfully")
    except subprocess.CalledProcessError as e:
        print(f"✗ Update failed: {e}")
        print(e.stderr)

# Schedule updates every Monday at 10:00 AM
schedule.every().monday.at("10:00").do(update_f1_data)

# Or schedule after every race (customize based on F1 calendar)
schedule.every().sunday.at("20:00").do(update_f1_data)  # After typical race time

print("F1 Data Update Scheduler Started")
print("Press Ctrl+C to stop")

while True:
    schedule.run_pending()
    time.sleep(60)  # Check every minute

Upload to HuggingFace (Maintainers)

For maintainers who want to upload data to HuggingFace, use the upload_to_hf.py script:

Setup

# Install HuggingFace Hub library
pip install huggingface-hub

# Set environment variables
export HF_TOKEN="your_huggingface_token"
export HF_REPO_ID="username/dataset-name"

# Run upload script
python upload_to_hf.py

Upload Function

From upload_to_hf.py:
def upload_to_huggingface(
    source_dir: Path, repo_id: str, token: str | None = None
) -> bool:
    """
    Upload datasets to HuggingFace Hub.

    Args:
        source_dir: Directory containing files to upload
        repo_id: HuggingFace repository ID (e.g., 'username/dataset-name')
        token: HuggingFace API token (if None, will use HF_TOKEN env var)

    Returns:
        True if successful, False otherwise
    """
    print(f"\n{'=' * 60}")
    print(f"Uploading to HuggingFace: {repo_id}")
    print(f"{'=' * 60}")

    # Check if token is provided
    if token is None:
        token = os.environ.get("HF_TOKEN")

    if not token:
        print("✗ No HuggingFace token provided. Skipping upload.")
        return False

    try:
        api = HfApi()

        # Check if dataset exists, create if not
        try:
            api.dataset_info(repo_id, token=token)
            print(f"✓ Dataset repository exists: {repo_id}")
        except Exception:
            print(f"  Creating new dataset repository: {repo_id}")
            api.create_repo(
                repo_id=repo_id, repo_type="dataset", token=token, exist_ok=True
            )

        # Upload all files from the data directory
        upload_folder(
            folder_path=str(source_dir),
            repo_id=repo_id,
            repo_type="dataset",
            token=token,
            commit_message=f"Update F1 datasets - {os.environ.get('COMMIT_DATE', 'manual update')}",
        )

        print(f"✓ Successfully uploaded to HuggingFace")
        print(f"  View at: https://huggingface.co/datasets/{repo_id}")
        return True

    except Exception as e:
        print(f"✗ Error uploading to HuggingFace: {e}")
        return False

Troubleshooting

Ensure your kaggle.json file is in the correct location:
# Check if file exists
ls -la ~/.kaggle/kaggle.json

# Verify permissions (should be 600)
chmod 600 ~/.kaggle/kaggle.json
If the download fails, try:
  1. Verify your Kaggle API credentials are valid
  2. Check your internet connection
  3. Ensure you’ve accepted the dataset license on Kaggle’s website
  4. Update kagglehub to the latest version: pip install --upgrade kagglehub
Make sure you have write permissions to the output directory:
# Create directory with proper permissions
mkdir -p ./data
chmod 755 ./data

Next Steps

Direct Download

Download the dataset as a zip file

HuggingFace Access

Use HuggingFace Datasets library

Quick Start

Start analyzing F1 data in minutes

Data Schema

Learn about the data structure

Build docs developers (and LLMs) love