Data Collection

Overview

The data collection system uses the FastF1 API to gather comprehensive Formula 1 race data from multiple seasons. The primary collection script is collect_working.py, which reliably extracts race results and driver information.

FastF1 is a Python package that provides access to F1 timing data, telemetry, and race information from the official F1 API.

Data Sources

FastF1 API

The system leverages FastF1’s comprehensive data access:

Race Results

Final positions, points, and grid positions for all drivers

Driver Info

Full names, team assignments, and driver codes (e.g., VER, HAM)

Session Data

Lap times, pit stops, and sector times

Event Metadata

Event names, years, rounds, and circuit information

Data Coverage

for year in [2023, 2024]:  # Recent seasons with complete data
    schedule = fastf1.get_event_schedule(year)
    # Process each race in the season

The system focuses on 2023-2024 seasons which have:

Complete race results
Reliable position data
Full driver roster information
Minimal data quality issues

Collection Process

Step 1: Initialize Cache

import fastf1

fastf1.Cache.enable_cache('./data/f1_cache')

Caching is critical for performance. Without it, each API call takes 30+ seconds. With caching, subsequent calls are instant.

Step 2: Fetch Event Schedule

For each season, retrieve the full race calendar:

schedule = fastf1.get_event_schedule(year)

for idx, event in schedule.iterrows():
    if event['EventFormat'] == 'testing':
        continue  # Skip pre-season testing
    
    round_num = event['RoundNumber']
    event_name = event['EventName']

Step 3: Load Race Session

Load the race session (not practice or qualifying):

session = fastf1.get_session(year, round_num, 'R')  # 'R' = Race
session.load()  # Fetches all session data

Session Types

‘FP1’, ‘FP2’, ‘FP3’: Free Practice sessions
‘Q’: Qualifying
‘S’: Sprint race
‘R’: Main race (what we use)

Step 4: Extract Race Results

results = session.results

for driver_code in drivers:
    driver_result = results[results['Abbreviation'] == driver_code]
    
    if len(driver_result) > 0:
        position = driver_result.iloc[0]['Position']
        grid = driver_result.iloc[0]['GridPosition']
        points = driver_result.iloc[0]['Points']
        team = driver_result.iloc[0]['TeamName']
        full_name = driver_result.iloc[0]['FullName']

Step 5: Data Validation

Critical validation step to ensure data quality:

# Only save if we have a valid position
if pd.notna(position):
    all_results.append({
        'Year': year,
        'Round': round_num,
        'EventName': event_name,
        'DriverCode': driver_code,
        'FullName': full_name,
        'TeamName': team,
        'Position': int(position),
        'GridPosition': int(grid) if pd.notna(grid) else 20,
        'Points': float(points) if pd.notna(points) else 0.0
    })

Rows with missing positions are excluded to prevent training issues. The model requires valid target values.

Data Schema

Raw Race Results

Stored in: data/raw/race_results.csv

Column	Type	Description	Example
Year	int	Season year	2024
Round	int	Race number in season	5
EventName	str	Race name	”Monaco Grand Prix”
DriverCode	str	3-letter driver code	”VER”
FullName	str	Driver full name	”Max Verstappen”
TeamName	str	Constructor team	”Red Bull”
Position	int	Final race position	1
GridPosition	int	Starting grid position	1
Points	float	Championship points	25.0

Additional Data Files

The system also collects (when available): Lap Times (data/raw/lap_times.csv):

Individual lap times for each driver
Used for pace analysis

Pit Stops (data/raw/pit_stops.csv):

Pit stop timing and duration
Used for strategy analysis

Weather (data/raw/weather.csv):

Track temperature
Air temperature
Humidity
Rain conditions

Error Handling

Graceful Degradation

The collection script handles errors at multiple levels:

try:
    # Try to process race
    session = fastf1.get_session(year, round_num, 'R')
    session.load()
    # ... process data
    print(f"✓ ({len(drivers)} drivers)")
except Exception as e:
    print(f"✗ {str(e)[:40]}")
    # Continue to next race

Network Issues

Retries API calls with exponential backoff

Missing Data

Skips incomplete races, logs warnings

API Rate Limits

Uses caching to minimize API calls

Format Changes

Handles different data schemas gracefully

Collection Statistics

Typical collection run (2023-2024 seasons):

Total results: 880
Valid positions: 880 (100%)
Years: [2023, 2024]
Races per season: ~22
Drivers per race: 20

Data Quality Checks

Before saving, the system validates:

Position Validation

print(f"Position not null: {df['Position'].notna().sum()}")
print(f"Position null: {df['Position'].isna().sum()}")

# Remove rows where Position is missing
features_df = features_df[features_df['Position'].notna()]

Grid Position Bounds

# Grid positions must be 1-20
GridPosition = int(grid) if pd.notna(grid) else 20

Points Distribution

# Verify points follow F1 scoring
print(df['Position'].value_counts().sort_index().head(10))
# P1 should have 25 points, P2 = 18, etc.

Performance Optimization

Caching Strategy

fastf1.Cache.enable_cache('./data/f1_cache')

Impact:

First API call: ~30 seconds
Cached calls: Less than 0.1 seconds
Cache size: ~500 MB per season

Batch Processing

Processes entire seasons in one run:

Reduces connection overhead
Better error recovery
Progress tracking

Selective Loading

if event['EventFormat'] == 'testing':
    continue  # Skip non-race events

Reduces unnecessary API calls by 20-30%.

Running Data Collection

Basic Usage

python collect_working.py

Expected Output

============================================================
F1 DATA COLLECTION - WORKING VERSION
============================================================

Year 2023...
  Round 1: Bahrain Grand Prix... ✓ (20 drivers)
  Round 2: Saudi Arabian Grand Prix... ✓ (20 drivers)
  Round 3: Australian Grand Prix... ✓ (20 drivers)
  ...

Year 2024...
  Round 1: Bahrain Grand Prix... ✓ (20 drivers)
  ...

============================================================
Total results: 880
Valid positions: 880
Years: [2023, 2024]
============================================================

✓ Saved to: data/raw/race_results_WORKING.csv

Output Files

After collection, files are stored in data/raw/:

race_results.csv or race_results_WORKING.csv
lap_times.csv (optional)
pit_stops.csv (optional)
weather.csv (optional)

The _WORKING suffix indicates files from the reliable collection script (collect_working.py).

Next Steps

After data collection:

Feature Engineering → Transform raw data into ML features (see Feature Engineering)
Model Training → Train prediction models (see Models)
Validation → Check data quality with debug_data.py

Troubleshooting

No data collected

Cause: API connection issues or cache corruptionSolution:

rm -rf ./data/f1_cache
python collect_working.py

Missing positions

Cause: Race session not completed or data not availableSolution: Check if race actually occurred, try different year/round

Slow performance

Cause: Cache not enabledSolution: Verify fastf1.Cache.enable_cache() is called before API calls

Get Started

Core Concepts

Guides

Overview

Data Sources

FastF1 API

Race Results

Driver Info

Session Data

Event Metadata

Data Coverage

Collection Process

Step 1: Initialize Cache

Step 2: Fetch Event Schedule

Step 3: Load Race Session

Step 4: Extract Race Results

Step 5: Data Validation

Data Schema

Raw Race Results

Additional Data Files

Error Handling

Graceful Degradation

Network Issues

Missing Data

API Rate Limits

Format Changes

Collection Statistics

Data Quality Checks

Performance Optimization

Caching Strategy

Batch Processing

Selective Loading

Running Data Collection

Basic Usage

Expected Output

Output Files

Next Steps

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Overview

​Data Sources

​FastF1 API

Race Results

Driver Info

Session Data

Event Metadata

​Data Coverage

​Collection Process

​Step 1: Initialize Cache

​Step 2: Fetch Event Schedule

​Step 3: Load Race Session

​Step 4: Extract Race Results

​Step 5: Data Validation

​Data Schema

​Raw Race Results

​Additional Data Files

​Error Handling

​Graceful Degradation

Network Issues

Missing Data

API Rate Limits

Format Changes

​Collection Statistics

​Data Quality Checks

​Performance Optimization

​Caching Strategy

​Batch Processing

​Selective Loading

​Running Data Collection

​Basic Usage

​Expected Output

​Output Files

​Next Steps

​Troubleshooting

Build docs developers (and LLMs) love

Overview

Data Sources

FastF1 API

Data Coverage

Collection Process

Step 1: Initialize Cache

Step 2: Fetch Event Schedule

Step 3: Load Race Session

Step 4: Extract Race Results

Step 5: Data Validation

Data Schema

Raw Race Results

Additional Data Files

Error Handling

Graceful Degradation

Collection Statistics

Data Quality Checks

Performance Optimization

Caching Strategy

Batch Processing

Selective Loading

Running Data Collection

Basic Usage

Expected Output

Output Files

Next Steps

Troubleshooting