HuggingFace Datasets

The RaceData Formula 1 dataset is available on HuggingFace Datasets, making it easy to load and use with popular data science libraries.

Overview

Access the dataset directly through the HuggingFace Datasets library without manual downloads. The dataset is automatically synchronized with the latest race data.

View on HuggingFace

Browse the dataset on HuggingFace Hub: tracinginsights/RaceData

Installation

Install HuggingFace Datasets

Install the datasets library using pip:

pip install datasets

Or install with additional dependencies for optimal performance:

pip install datasets[pandas]

Verify Installation

Verify the installation by importing the library:

from datasets import load_dataset
print("HuggingFace Datasets installed successfully!")

Loading the Dataset

Basic Usage

Load the entire RaceData dataset with a single line of code:

from datasets import load_dataset

# Load the complete dataset
dataset = load_dataset("tracinginsights/RaceData")

# Access the data files
print(dataset)

The first time you load the dataset, it will be downloaded and cached locally. Subsequent loads will use the cached version.

Loading Specific Tables

Since the dataset consists of multiple CSV files, you can load specific tables:

from datasets import load_dataset

# Load specific data files
dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files={
        "drivers": "drivers.csv",
        "races": "races.csv",
        "results": "results.csv"
    }
)

# Access individual tables
drivers_df = dataset["drivers"]
races_df = dataset["races"]
results_df = dataset["results"]

Convert to Pandas DataFrame

Easily convert HuggingFace datasets to pandas DataFrames for analysis:

from datasets import load_dataset
import pandas as pd

# Load a specific table
dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="lap_times.csv",
    split="train"
)

# Convert to pandas DataFrame
lap_times_df = dataset.to_pandas()

# Now use standard pandas operations
print(lap_times_df.head())
print(lap_times_df.describe())

Usage Examples

from datasets import load_dataset
import pandas as pd

# Load multiple tables
dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files={
        "drivers": "drivers.csv",
        "results": "results.csv",
        "races": "races.csv"
    }
)

# Convert to pandas DataFrames
drivers = dataset["drivers"].to_pandas()
results = dataset["results"].to_pandas()
races = dataset["races"].to_pandas()

# Example analysis: Most wins by driver
wins = results[results['position'] == 1].copy()
wins = wins.merge(drivers, on='driverId')
top_winners = wins.groupby(['forename', 'surname']).size().sort_values(ascending=False).head(10)

print("Top 10 Drivers by Race Wins:")
print(top_winners)

Advanced Features

Caching and Performance

The HuggingFace Datasets library automatically caches downloaded data:

from datasets import load_dataset

# First load: downloads and caches data
dataset = load_dataset("tracinginsights/RaceData", data_files="races.csv")

# Subsequent loads: uses cached version (much faster)
dataset = load_dataset("tracinginsights/RaceData", data_files="races.csv")

To force a fresh download:

dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="races.csv",
    download_mode="force_redownload"
)

Data Splits

By default, CSV files are loaded as a single “train” split:

from datasets import load_dataset

dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="drivers.csv"
)

# Access the default split
drivers = dataset["train"]

# Or specify the split explicitly
drivers = load_dataset(
    "tracinginsights/RaceData",
    data_files="drivers.csv",
    split="train"
)

Integration with Other Libraries

PyArrow
Polars
DuckDB

HuggingFace Datasets uses Apache Arrow under the hood for efficient data handling:

from datasets import load_dataset

dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="results.csv",
    split="train"
)

# Get Arrow table
arrow_table = dataset.data
print(type(arrow_table))  # pyarrow.Table

Convert to Polars DataFrame for high-performance data manipulation:

from datasets import load_dataset
import polars as pl

dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="lap_times.csv",
    split="train"
)

# Convert to Polars
df = pl.from_arrow(dataset.data.to_pandas())
print(df.head())

Query the data directly with DuckDB:

from datasets import load_dataset
import duckdb

dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="races.csv",
    split="train"
)

# Convert to pandas then query with DuckDB
df = dataset.to_pandas()
result = duckdb.query("SELECT * FROM df WHERE year >= 2020").df()
print(result)

Troubleshooting

Dataset not found error

If you encounter a “Dataset not found” error, ensure you’re using the correct repository ID:

# Correct format
dataset = load_dataset("tracinginsights/RaceData")

# Not: "TracingInsights/RaceData" (case-sensitive)

Memory issues with large files

For large files like lap_times.csv, use streaming mode:

dataset = load_dataset(
    "tracinginsights/RaceData",
    data_files="lap_times.csv",
    streaming=True
)

Cache location

By default, datasets are cached in ~/.cache/huggingface/datasets/. To use a different location:

from datasets import load_dataset

dataset = load_dataset(
    "tracinginsights/RaceData",
    cache_dir="/path/to/custom/cache"
)

Benefits of HuggingFace Access

No Manual Downloads

Load data directly in your code without managing files

Automatic Caching

Downloaded data is cached locally for fast subsequent access

Streaming Support

Process large datasets without loading everything into memory

Easy Integration

Works seamlessly with pandas, PyArrow, Polars, and more

Next Steps

Direct Download

Download the complete dataset as a zip file

Programmatic Access

Use Python scripts to automate data downloads

Data Schema

Learn about the structure of each table

Quick Start

Start analyzing F1 data in minutes

Get Started

Data Access

Data Schema

Guides

HuggingFace Datasets

Overview

View on HuggingFace

Installation

Loading the Dataset

Basic Usage

Loading Specific Tables

Convert to Pandas DataFrame

Usage Examples

Advanced Features

Caching and Performance

Data Splits

Integration with Other Libraries

Troubleshooting

Benefits of HuggingFace Access

No Manual Downloads

Automatic Caching

Streaming Support

Easy Integration

Next Steps

Direct Download

Programmatic Access

Data Schema

Quick Start

Build docs developers (and LLMs) love

Get Started

Data Access

Data Schema

Guides

​Overview

View on HuggingFace

​Installation

​Loading the Dataset

​Basic Usage

​Loading Specific Tables

​Convert to Pandas DataFrame

​Usage Examples

​Advanced Features

​Caching and Performance

​Data Splits

​Integration with Other Libraries

​Troubleshooting

​Benefits of HuggingFace Access

No Manual Downloads

Automatic Caching

Streaming Support

Easy Integration

​Next Steps

Direct Download

Programmatic Access

Data Schema

Quick Start

Build docs developers (and LLMs) love

Overview

Installation

Loading the Dataset

Basic Usage

Loading Specific Tables

Convert to Pandas DataFrame

Usage Examples

Advanced Features

Caching and Performance

Data Splits

Integration with Other Libraries

Troubleshooting

Benefits of HuggingFace Access

Next Steps