Skip to main content
The RaceData Formula 1 dataset is available on HuggingFace Datasets, making it easy to load and use with popular data science libraries.

Overview

Access the dataset directly through the HuggingFace Datasets library without manual downloads. The dataset is automatically synchronized with the latest race data.

View on HuggingFace

Browse the dataset on HuggingFace Hub: tracinginsights/RaceData

Installation

1

Install HuggingFace Datasets

Install the datasets library using pip:
pip install datasets
Or install with additional dependencies for optimal performance:
pip install datasets[pandas]
2

Verify Installation

Verify the installation by importing the library:
from datasets import load_dataset
print("HuggingFace Datasets installed successfully!")

Loading the Dataset

Basic Usage

Load the entire RaceData dataset with a single line of code:
from datasets import load_dataset

# Load the complete dataset
dataset = load_dataset("tracinginsights/RaceData")

# Access the data files
print(dataset)
The first time you load the dataset, it will be downloaded and cached locally. Subsequent loads will use the cached version.

Loading Specific Tables

Since the dataset consists of multiple CSV files, you can load specific tables:
from datasets import load_dataset

# Load specific data files
dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files={
        "drivers": "drivers.csv",
        "races": "races.csv",
        "results": "results.csv"
    }
)

# Access individual tables
drivers_df = dataset["drivers"]
races_df = dataset["races"]
results_df = dataset["results"]

Convert to Pandas DataFrame

Easily convert HuggingFace datasets to pandas DataFrames for analysis:
from datasets import load_dataset
import pandas as pd

# Load a specific table
dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="lap_times.csv",
    split="train"
)

# Convert to pandas DataFrame
lap_times_df = dataset.to_pandas()

# Now use standard pandas operations
print(lap_times_df.head())
print(lap_times_df.describe())

Usage Examples

from datasets import load_dataset
import pandas as pd

# Load multiple tables
dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files={
        "drivers": "drivers.csv",
        "results": "results.csv",
        "races": "races.csv"
    }
)

# Convert to pandas DataFrames
drivers = dataset["drivers"].to_pandas()
results = dataset["results"].to_pandas()
races = dataset["races"].to_pandas()

# Example analysis: Most wins by driver
wins = results[results['position'] == 1].copy()
wins = wins.merge(drivers, on='driverId')
top_winners = wins.groupby(['forename', 'surname']).size().sort_values(ascending=False).head(10)

print("Top 10 Drivers by Race Wins:")
print(top_winners)

Advanced Features

Caching and Performance

The HuggingFace Datasets library automatically caches downloaded data:
from datasets import load_dataset

# First load: downloads and caches data
dataset = load_dataset("tracinginsights/RaceData", data_files="races.csv")

# Subsequent loads: uses cached version (much faster)
dataset = load_dataset("tracinginsights/RaceData", data_files="races.csv")
To force a fresh download:
dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="races.csv",
    download_mode="force_redownload"
)

Data Splits

By default, CSV files are loaded as a single “train” split:
from datasets import load_dataset

dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="drivers.csv"
)

# Access the default split
drivers = dataset["train"]

# Or specify the split explicitly
drivers = load_dataset(
    "tracinginsights/RaceData",
    data_files="drivers.csv",
    split="train"
)

Integration with Other Libraries

HuggingFace Datasets uses Apache Arrow under the hood for efficient data handling:
from datasets import load_dataset

dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="results.csv",
    split="train"
)

# Get Arrow table
arrow_table = dataset.data
print(type(arrow_table))  # pyarrow.Table

Troubleshooting

If you encounter a “Dataset not found” error, ensure you’re using the correct repository ID:
# Correct format
dataset = load_dataset("tracinginsights/RaceData")

# Not: "TracingInsights/RaceData" (case-sensitive)
For large files like lap_times.csv, use streaming mode:
dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="lap_times.csv",
    streaming=True
)
By default, datasets are cached in ~/.cache/huggingface/datasets/. To use a different location:
from datasets import load_dataset

dataset = load_dataset(
    "tracinginsights/RaceData",
    cache_dir="/path/to/custom/cache"
)

Benefits of HuggingFace Access

No Manual Downloads

Load data directly in your code without managing files

Automatic Caching

Downloaded data is cached locally for fast subsequent access

Streaming Support

Process large datasets without loading everything into memory

Easy Integration

Works seamlessly with pandas, PyArrow, Polars, and more

Next Steps

Direct Download

Download the complete dataset as a zip file

Programmatic Access

Use Python scripts to automate data downloads

Data Schema

Learn about the structure of each table

Quick Start

Start analyzing F1 data in minutes

Build docs developers (and LLMs) love