upload_to_hf.py

Overview

The upload_to_hf.py script automates the process of uploading Formula 1 datasets to HuggingFace Hub, making the data accessible through HuggingFace’s dataset platform.

What It Does

Validates required environment variables and authentication tokens
Checks if the target HuggingFace dataset repository exists
Creates a new repository if it doesn’t exist
Uploads all files from the local data/ directory to HuggingFace
Provides detailed console output for monitoring the upload process

Functions

upload_to_huggingface

Uploads datasets to HuggingFace Hub with automatic repository creation.

source_dir

Path

required

Directory containing files to upload

repo_id

str

required

HuggingFace repository ID in the format username/dataset-name (e.g., ‘racedata/formula-1’)

token

str | None

default:"None"

HuggingFace API token. If None, the function will use the HF_TOKEN environment variable

returns

bool

Returns True if successful, False otherwise

Example Usage

from pathlib import Path
from upload_to_hf import upload_to_huggingface

data_dir = Path("./data")
repo_id = "your-username/formula-1-dataset"
token = "hf_your_token_here"  # Or set HF_TOKEN env var

success = upload_to_huggingface(data_dir, repo_id, token)
if success:
    print(f"Dataset available at: https://huggingface.co/datasets/{repo_id}")

Implementation Details

def upload_to_huggingface(
    source_dir: Path, repo_id: str, token: str | None = None
) -> bool:
    # Check if token is provided
    if token is None:
        token = os.environ.get("HF_TOKEN")
    
    if not token:
        print("✗ No HuggingFace token provided. Skipping upload.")
        return False
    
    api = HfApi()
    
    # Check if dataset exists, create if not
    try:
        api.dataset_info(repo_id, token=token)
    except Exception:
        api.create_repo(
            repo_id=repo_id, repo_type="dataset", token=token, exist_ok=True
        )
    
    # Upload all files from the data directory
    upload_folder(
        folder_path=str(source_dir),
        repo_id=repo_id,
        repo_type="dataset",
        token=token,
        commit_message=f"Update F1 datasets - {os.environ.get('COMMIT_DATE', 'manual update')}",
    )

main

Main execution function that orchestrates the upload process.

returns

int

Returns 0 on success, 1 on failure

Workflow

Validates that HF_REPO_ID environment variable is set
Checks that the data/ directory exists and contains files
Calls upload_to_huggingface() to perform the upload
Reports success/failure status

Usage

Running the Script

export HF_REPO_ID="your-username/formula-1-dataset"
export HF_TOKEN="hf_your_token_here"
python upload_to_hf.py

Command Line Execution

The script can be executed directly:

./upload_to_hf.py

Required Environment Variables

The script requires two environment variables to function:

HF_TOKEN

string

required

Your HuggingFace API token for authentication

How to get your token:

Create a HuggingFace account at huggingface.co
Go to Settings → Access Tokens
Click “New token” and create a token with write permissions
Copy the token (starts with hf_)

Setting the token:

export HF_TOKEN="hf_your_token_here"

Or in GitHub Actions:

env:
  HF_TOKEN: ${{ secrets.HF_TOKEN }}

HF_REPO_ID

string

required

The HuggingFace repository ID where datasets will be uploaded

Format: username/dataset-name or organization/dataset-name Example:

export HF_REPO_ID="racedata/formula-1"

COMMIT_DATE (Optional)

COMMIT_DATE

string

Used in the commit message to track when the upload occurred. Defaults to ‘manual update’ if not set.

export COMMIT_DATE="$(date +'%Y-%m-%d')"

Configuration

Repository Structure

The script expects the following directory structure:

project/
├── data/                    # Source directory
│   ├── circuits.csv
│   ├── constructors.csv
│   ├── drivers.csv
│   └── ...
└── upload_to_hf.py

HuggingFace Repository

The script will:

Create a new dataset repository if it doesn’t exist
Update an existing repository with new files
Preserve existing files unless overwritten

Dependencies

From pyproject.toml:

dependencies = [
    "kagglehub==0.3.13",
    "huggingface-hub==0.35.3",
]

Required Packages

huggingface-hub (0.35.3) - HuggingFace Hub API client
- HfApi - API client for repository management
- upload_folder - Bulk file upload functionality
pathlib - File path operations (Python standard library)
os - Environment variable access (Python standard library)

GitHub Actions Integration

This script is designed to work in GitHub Actions workflows:

- name: Upload to HuggingFace
  env:
    HF_TOKEN: ${{ secrets.HF_TOKEN }}
    HF_REPO_ID: ${{ vars.HF_REPO_ID }}
    COMMIT_DATE: ${{ github.event.head_commit.timestamp }}
  run: python upload_to_hf.py

Setting Up Secrets

Go to your GitHub repository → Settings → Secrets and variables → Actions
Add a new repository secret named HF_TOKEN
Add a new repository variable named HF_REPO_ID

Error Handling

The script includes comprehensive error handling:

Missing Token

✗ No HuggingFace token provided. Skipping upload.
  Set HF_TOKEN environment variable or pass token parameter.

Missing Repository ID

✗ HF_REPO_ID environment variable not set. Exiting.

Empty Data Directory

✗ Data directory './data' not found or is empty. Exiting.

Upload Failure

✗ Error uploading to HuggingFace: [error details]

Console Output

The script provides detailed progress information:

Formula 1 Dataset Upload Script
============================================================
Target HuggingFace repo: racedata/formula-1
Data directory: /path/to/data

============================================================
Uploading to HuggingFace: racedata/formula-1
============================================================
✓ Dataset repository exists: racedata/formula-1
  Uploading files from /path/to/data...
✓ Successfully uploaded to HuggingFace
  View at: https://huggingface.co/datasets/racedata/formula-1

============================================================
✓ Upload completed successfully!
============================================================

Example Workflow

Here’s a complete example combining both scripts:

name: Update F1 Dataset

on:
  schedule:
    - cron: '0 0 * * 0'  # Weekly on Sunday
  workflow_dispatch:

jobs:
  update-dataset:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11.14'
      
      - name: Install dependencies
        run: pip install kagglehub==0.3.13 huggingface-hub==0.35.3
      
      - name: Download datasets from Kaggle
        env:
          KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }}
          KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }}
        run: python download_datasets.py
      
      - name: Upload to HuggingFace
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
          HF_REPO_ID: ${{ vars.HF_REPO_ID }}
          COMMIT_DATE: ${{ github.event.head_commit.timestamp }}
        run: python upload_to_hf.py

Accessing Your Dataset

After successful upload, your dataset will be available at:

https://huggingface.co/datasets/{your-username}/{dataset-name}

Users can then load your dataset using:

from datasets import load_dataset

dataset = load_dataset("your-username/dataset-name")

Tables

Scripts

Overview

What It Does

Functions

upload_to_huggingface

Example Usage

Implementation Details

main

Workflow

Usage

Running the Script

Command Line Execution

Required Environment Variables

HF_TOKEN

HF_REPO_ID

COMMIT_DATE (Optional)

Configuration

Repository Structure

HuggingFace Repository

Dependencies

Required Packages

GitHub Actions Integration

Setting Up Secrets

Error Handling

Missing Token

Missing Repository ID

Empty Data Directory

Upload Failure

Console Output

Example Workflow

Accessing Your Dataset

Build docs developers (and LLMs) love

Tables

Scripts

​Overview

​What It Does

​Functions

​upload_to_huggingface

​Example Usage

​Implementation Details

​main

​Workflow

​Usage

​Running the Script

​Command Line Execution

​Required Environment Variables

​HF_TOKEN

​HF_REPO_ID

​COMMIT_DATE (Optional)

​Configuration

​Repository Structure

​HuggingFace Repository

​Dependencies

​Required Packages

​GitHub Actions Integration

​Setting Up Secrets

​Error Handling

​Missing Token

​Missing Repository ID

​Empty Data Directory

​Upload Failure

​Console Output

​Example Workflow

​Accessing Your Dataset

Build docs developers (and LLMs) love

Overview

What It Does

Functions

upload_to_huggingface

Example Usage

Implementation Details

main

Workflow

Usage

Running the Script

Command Line Execution

Required Environment Variables

HF_TOKEN

HF_REPO_ID

COMMIT_DATE (Optional)

Configuration

Repository Structure

HuggingFace Repository

Dependencies

Required Packages

GitHub Actions Integration

Setting Up Secrets

Error Handling

Missing Token

Missing Repository ID

Empty Data Directory

Upload Failure

Console Output

Example Workflow

Accessing Your Dataset