Skip to main content

Data Overview

The F1 ML Prediction System is built on 7 years of historical Formula 1 data (2018-2024) collected via the FastF1 API.

Lap Times

139,135 lapsComplete lap-by-lap telemetry data

Race Results

2,537 resultsFinishing positions and points

Pit Stops

4,512 stopsPit stop timing and duration

Weather Records

127 recordsRace day weather conditions

Data Sources

Primary Data Source: FastF1 API

FastF1 is the official Python library for accessing Formula 1 timing data, telemetry, and session information.
The data collection script uses FastF1 to gather:
  • Race results and qualifying positions
  • Lap-by-lap timing data
  • Pit stop information
  • Weather conditions per session
  • Driver and team metadata
Collection Script: source/src/data/f1_data_collector.py

Dataset Files

All data files are stored in CSV format for easy processing and analysis.

Raw Data Files

Location: data/raw/
2,537 records - Race finishing positions and championship pointsColumns:
Year, Round, RaceName, Driver, DriverCode, Team, 
GridPosition, Position, Points, Status, Laps
Sample Data:
Year,Round,RaceName,Driver,DriverCode,Team,GridPosition,Position,Points,Status,Laps
2024,1,Bahrain Grand Prix,Max Verstappen,VER,Red Bull Racing,1,1,25,Finished,57
2024,1,Bahrain Grand Prix,Sergio Perez,PER,Red Bull Racing,2,2,18,Finished,57
Key Statistics:
  • Races: 127 (7 years × ~18 races/year)
  • Drivers: 40+ unique drivers
  • Teams: 12 teams across the period
  • DNF Rate: ~12% of all results

Processed Data Files

Location: data/processed/ Engineered features ready for machine learning:
Main feature dataset for winner prediction modelCreated by: source/feature_engineering.pyFeatures: 21 columns including:
  • Driver historical stats (wins, podiums, avg position)
  • Team performance metrics
  • Circuit-specific experience
  • Weather conditions
  • Grid position advantages
Records: Same as race_results (2,537) but with engineered features

Data Collection Process

1

Setup FastF1 Cache

Configure local cache directory for faster data access:
import fastf1
fastf1.Cache.enable_cache('data/f1_cache/')
Location: data/f1_cache/ (automatically created)
2

Fetch Race Sessions

Collect data for each race from 2018-2024:
for year in range(2018, 2025):
    for round in range(1, 25):  # Max 24 races per season
        session = fastf1.get_session(year, round, 'R')
        session.load()
Duration: 2-4 hours depending on internet speed
3

Extract & Transform

Process raw session data into structured CSV files:
  • Race results from session.results
  • Lap times from session.laps
  • Pit stops from session.laps.pick_pit_stops()
  • Weather from session.weather_data
4

Save to CSV

Export all data to CSV format:
race_df.to_csv('data/raw/race_results.csv', index=False)
laps_df.to_csv('data/raw/lap_times.csv', index=False)
pits_df.to_csv('data/raw/pit_stops.csv', index=False)
weather_df.to_csv('data/raw/weather.csv', index=False)
Data Collection Notes:
  • First-time collection takes 2-4 hours
  • Requires stable internet connection
  • Some older races may have incomplete data
  • FastF1 API occasionally rate-limits requests

Data Statistics

Dataset Completeness

All 127 races from 2018-2024 have complete results:
  • Total Results: 2,537
  • Finished Races: 2,233 (88%)
  • DNFs: 304 (12%)
  • Missing Data: 0%
Every race has full driver positions, points, and status information.
139,135 laps recorded with minor gaps:
  • Complete Laps: 136,357 (98%)
  • Missing Sector Times: 2,778 (2%)
  • Reasons: Data transmission issues, red flags
Gaps filled using interpolation or removed from training data.
4,512 pit stops with some duration data missing:
  • Complete Records: 4,286 (95%)
  • Missing Duration: 226 (5%)
  • Reasons: Penalty stops, drive-through penalties
Missing durations estimated using race average (~23s).
127 race weather records with some sensor gaps:
  • Complete Weather: 114 races (90%)
  • Partial Data: 13 races (10%)
  • Missing Fields: Usually wind speed or pressure
Missing weather data filled with circuit historical averages.

Top Drivers (2018-2024)

Max Verstappen

52 wins | 78 podiumsDominant 2021-2024 era

Lewis Hamilton

36 wins | 71 podiumsMercedes dynasty 2018-2021

Charles Leclerc

6 wins | 32 podiumsFerrari’s lead driver

Top Teams (2018-2024)

  • Wins: 78
  • Podiums: 156
  • Championships: 3 (2021, 2022, 2023)
  • Average Position: 2.1

Data Quality & Validation

Automated data quality checks:No duplicate records - Each driver × race combination is uniqueValid ranges - Lap times between 0-300 secondsPosition consistency - Finishing positions 1-20 per racePoints validation - Points match FIA rules (25-18-15-12-10-8-6-4-2-1)Team continuity - Team names normalized across seasons

Tire Compound Statistics

Usage: 35% of stintsDegradation: +0.08 sec/lap averageOptimal Stint: 15-20 lapsBest Circuits: Monaco, Singapore, Hungary

Updating the Dataset

1

Recollect Recent Data

Update with latest 2024/2025 races:
python source/recollect_data.py
This incremental update is much faster than full collection.
2

Re-run Feature Engineering

Regenerate processed features:
python source/feature_engineering.py
3

Retrain Models

Update ML models with new data:
python train_all_models.py

API Endpoints for Data

The Flask app serves data via REST API:
# Get driver comparison stats
GET /api/compare?driver1=VER&driver2=HAM

# Simulate race with conditions
GET /api/simulate_race?weather=DRY

# Get 2026 predictions
GET /api/season_2026

# Run lap-by-lap race
GET /api/lap_race?weather=DRY&circuit=STANDARD
Server: source/src/app.py or source/app.py

Additional Resources

FastF1 Documentation

Official API documentation and examples

FIA Official Data

Formula 1 official timing and results

Data Collector Script

source/src/data/f1_data_collector.py

Feature Engineering

source/feature_engineering.py

Build docs developers (and LLMs) love