Skip to main content

Overview

The data collection system uses the FastF1 API to gather comprehensive Formula 1 race data from multiple seasons. The primary collection script is collect_working.py, which reliably extracts race results and driver information.
FastF1 is a Python package that provides access to F1 timing data, telemetry, and race information from the official F1 API.

Data Sources

FastF1 API

The system leverages FastF1’s comprehensive data access:

Race Results

Final positions, points, and grid positions for all drivers

Driver Info

Full names, team assignments, and driver codes (e.g., VER, HAM)

Session Data

Lap times, pit stops, and sector times

Event Metadata

Event names, years, rounds, and circuit information

Data Coverage

for year in [2023, 2024]:  # Recent seasons with complete data
    schedule = fastf1.get_event_schedule(year)
    # Process each race in the season
The system focuses on 2023-2024 seasons which have:
  • Complete race results
  • Reliable position data
  • Full driver roster information
  • Minimal data quality issues

Collection Process

Step 1: Initialize Cache

import fastf1

fastf1.Cache.enable_cache('./data/f1_cache')
Caching is critical for performance. Without it, each API call takes 30+ seconds. With caching, subsequent calls are instant.

Step 2: Fetch Event Schedule

For each season, retrieve the full race calendar:
schedule = fastf1.get_event_schedule(year)

for idx, event in schedule.iterrows():
    if event['EventFormat'] == 'testing':
        continue  # Skip pre-season testing
    
    round_num = event['RoundNumber']
    event_name = event['EventName']

Step 3: Load Race Session

Load the race session (not practice or qualifying):
session = fastf1.get_session(year, round_num, 'R')  # 'R' = Race
session.load()  # Fetches all session data
  • ‘FP1’, ‘FP2’, ‘FP3’: Free Practice sessions
  • ‘Q’: Qualifying
  • ‘S’: Sprint race
  • ‘R’: Main race (what we use)

Step 4: Extract Race Results

results = session.results

for driver_code in drivers:
    driver_result = results[results['Abbreviation'] == driver_code]
    
    if len(driver_result) > 0:
        position = driver_result.iloc[0]['Position']
        grid = driver_result.iloc[0]['GridPosition']
        points = driver_result.iloc[0]['Points']
        team = driver_result.iloc[0]['TeamName']
        full_name = driver_result.iloc[0]['FullName']

Step 5: Data Validation

Critical validation step to ensure data quality:
# Only save if we have a valid position
if pd.notna(position):
    all_results.append({
        'Year': year,
        'Round': round_num,
        'EventName': event_name,
        'DriverCode': driver_code,
        'FullName': full_name,
        'TeamName': team,
        'Position': int(position),
        'GridPosition': int(grid) if pd.notna(grid) else 20,
        'Points': float(points) if pd.notna(points) else 0.0
    })
Rows with missing positions are excluded to prevent training issues. The model requires valid target values.

Data Schema

Raw Race Results

Stored in: data/raw/race_results.csv
ColumnTypeDescriptionExample
YearintSeason year2024
RoundintRace number in season5
EventNamestrRace name”Monaco Grand Prix”
DriverCodestr3-letter driver code”VER”
FullNamestrDriver full name”Max Verstappen”
TeamNamestrConstructor team”Red Bull”
PositionintFinal race position1
GridPositionintStarting grid position1
PointsfloatChampionship points25.0

Additional Data Files

The system also collects (when available): Lap Times (data/raw/lap_times.csv):
  • Individual lap times for each driver
  • Used for pace analysis
Pit Stops (data/raw/pit_stops.csv):
  • Pit stop timing and duration
  • Used for strategy analysis
Weather (data/raw/weather.csv):
  • Track temperature
  • Air temperature
  • Humidity
  • Rain conditions

Error Handling

Graceful Degradation

The collection script handles errors at multiple levels:
try:
    # Try to process race
    session = fastf1.get_session(year, round_num, 'R')
    session.load()
    # ... process data
    print(f"✓ ({len(drivers)} drivers)")
except Exception as e:
    print(f"✗ {str(e)[:40]}")
    # Continue to next race

Network Issues

Retries API calls with exponential backoff

Missing Data

Skips incomplete races, logs warnings

API Rate Limits

Uses caching to minimize API calls

Format Changes

Handles different data schemas gracefully

Collection Statistics

Typical collection run (2023-2024 seasons):
Total results: 880
Valid positions: 880 (100%)
Years: [2023, 2024]
Races per season: ~22
Drivers per race: 20

Data Quality Checks

Before saving, the system validates:
print(f"Position not null: {df['Position'].notna().sum()}")
print(f"Position null: {df['Position'].isna().sum()}")

# Remove rows where Position is missing
features_df = features_df[features_df['Position'].notna()]
# Grid positions must be 1-20
GridPosition = int(grid) if pd.notna(grid) else 20
# Verify points follow F1 scoring
print(df['Position'].value_counts().sort_index().head(10))
# P1 should have 25 points, P2 = 18, etc.

Performance Optimization

Caching Strategy

fastf1.Cache.enable_cache('./data/f1_cache')
Impact:
  • First API call: ~30 seconds
  • Cached calls: Less than 0.1 seconds
  • Cache size: ~500 MB per season

Batch Processing

Processes entire seasons in one run:
  • Reduces connection overhead
  • Better error recovery
  • Progress tracking

Selective Loading

if event['EventFormat'] == 'testing':
    continue  # Skip non-race events
Reduces unnecessary API calls by 20-30%.

Running Data Collection

Basic Usage

python collect_working.py

Expected Output

============================================================
F1 DATA COLLECTION - WORKING VERSION
============================================================

Year 2023...
  Round 1: Bahrain Grand Prix... ✓ (20 drivers)
  Round 2: Saudi Arabian Grand Prix... ✓ (20 drivers)
  Round 3: Australian Grand Prix... ✓ (20 drivers)
  ...

Year 2024...
  Round 1: Bahrain Grand Prix... ✓ (20 drivers)
  ...

============================================================
Total results: 880
Valid positions: 880
Years: [2023, 2024]
============================================================

✓ Saved to: data/raw/race_results_WORKING.csv

Output Files

After collection, files are stored in data/raw/:
  • race_results.csv or race_results_WORKING.csv
  • lap_times.csv (optional)
  • pit_stops.csv (optional)
  • weather.csv (optional)
The _WORKING suffix indicates files from the reliable collection script (collect_working.py).

Next Steps

After data collection:
  1. Feature Engineering → Transform raw data into ML features (see Feature Engineering)
  2. Model Training → Train prediction models (see Models)
  3. Validation → Check data quality with debug_data.py

Troubleshooting

Cause: API connection issues or cache corruptionSolution:
rm -rf ./data/f1_cache
python collect_working.py
Cause: Race session not completed or data not availableSolution: Check if race actually occurred, try different year/round
Cause: Cache not enabledSolution: Verify fastf1.Cache.enable_cache() is called before API calls

Build docs developers (and LLMs) love