Data Collector API

Overview

The Data Collector module uses the FastF1 API to fetch historical Formula 1 race data, including race results, driver information, and grid positions. It processes data from multiple seasons and exports it to CSV format for feature engineering and model training. Source File: collect_working.py

FastF1 Integration

Cache Configuration

The collector uses FastF1’s caching mechanism to improve performance and reduce API calls:

import fastf1
fastf1.Cache.enable_cache('./data/f1_cache')

Caching is essential for performance. The cache directory stores downloaded session data to avoid repeated API calls.

Data Collection Pipeline

Main Collection Loop

The collector iterates through years and race events to gather comprehensive race data:

all_results = []

for year in [2023, 2024]:
    schedule = fastf1.get_event_schedule(year)
    
    for idx, event in schedule.iterrows():
        if event['EventFormat'] == 'testing':
            continue
        
        session = fastf1.get_session(year, round_num, 'R')
        session.load()

Core Functions

get_event_schedule()

year

int

required

The Formula 1 season year (e.g., 2023, 2024)

Returns: DataFrame containing the event schedule with columns:

RoundNumber: Race round in the season
EventName: Name of the Grand Prix
EventFormat: Format type (‘conventional’, ‘testing’, ‘sprint’)

Example:

schedule = fastf1.get_event_schedule(2024)
# Returns: DataFrame with all 2024 race events

get_session()

year

int

required

The Formula 1 season year

round_num

int

required

The round number from the event schedule

session_type

str

required

Session identifier: ‘R’ (Race), ‘Q’ (Qualifying), ‘FP1’, ‘FP2’, ‘FP3’

Returns: Session object containing lap data, results, and timing information Example:

session = fastf1.get_session(2024, 1, 'R')
session.load()  # Loads all session data
laps = session.laps  # Access lap-by-lap data

get_driver()

driver_code

str

required

Three-letter driver code (e.g., ‘VER’, ‘HAM’, ‘LEC’)

Returns: Driver information object containing:

Driver name
Team affiliation
Driver number

Example:

driver_info = session.get_driver('VER')
# Returns: Driver info for Max Verstappen

Data Extraction

Race Results Extraction

The collector extracts detailed race results for each driver:

results = session.results
driver_result = results[results['Abbreviation'] == driver_code]

if len(driver_result) > 0:
    position = driver_result.iloc[0]['Position']
    grid = driver_result.iloc[0]['GridPosition']
    points = driver_result.iloc[0]['Points']
    team = driver_result.iloc[0]['TeamName']
    full_name = driver_result.iloc[0]['FullName']

Results DataFrame Fields

Position

int

Final race finishing position (1-20)

GridPosition

int

Starting grid position (1-20)

Points

float

Championship points earned (25 for win, 18 for 2nd, etc.)

TeamName

str

Constructor/team name (e.g., ‘Red Bull Racing’, ‘Mercedes’)

FullName

str

Driver’s full name

Abbreviation

str

Three-letter driver code

Data Structure

Output Record Format

Each race result is stored as a dictionary with the following structure:

{
    'Year': 2024,
    'Round': 1,
    'EventName': 'Bahrain Grand Prix',
    'DriverCode': 'VER',
    'FullName': 'Max Verstappen',
    'TeamName': 'Red Bull Racing',
    'Position': 1,
    'GridPosition': 1,
    'Points': 25.0
}

Year

int

Season year (2023, 2024)

Round

int

Race round number in the season (1-24)

EventName

str

Full Grand Prix name

DriverCode

str

Three-letter driver abbreviation

FullName

str

Driver’s complete name

TeamName

str

Team/constructor name

Position

int

Final classification position

GridPosition

int

Starting position on grid (default: 20 if missing)

Points

float

Points scored (default: 0.0 if none)

CSV Output Format

race_results_WORKING.csv

The collected data is exported to a CSV file with the following structure:

Year,Round,EventName,DriverCode,FullName,TeamName,Position,GridPosition,Points
2024,1,Bahrain Grand Prix,VER,Max Verstappen,Red Bull Racing,1,1,25.0
2024,1,Bahrain Grand Prix,PER,Sergio Perez,Red Bull Racing,2,2,18.0
2024,1,Bahrain Grand Prix,SAI,Carlos Sainz,Ferrari,3,3,15.0

The CSV is saved to ./data/raw/race_results_WORKING.csv and contains results from all processed races.

Error Handling

Event-Level Error Handling

The collector implements robust error handling for individual race events:

try:
    session = fastf1.get_session(year, round_num, 'R')
    session.load()
    # Process session data
    print(f"✓ ({len(drivers)} drivers)")
except Exception as e:
    print(f"✗ {str(e)[:40]}")
    # Continue with next event

Failed events are skipped, and the collector continues with remaining races to maximize data collection.

Data Validation

The collector validates data before saving:

# Only save if we have a valid position
if pd.notna(position):
    all_results.append({...})

Usage Example

import fastf1
import pandas as pd

# Enable caching
fastf1.Cache.enable_cache('./data/f1_cache')

all_results = []

# Collect data for 2023-2024 seasons
for year in [2023, 2024]:
    schedule = fastf1.get_event_schedule(year)
    
    for idx, event in schedule.iterrows():
        # Skip testing events
        if event['EventFormat'] == 'testing':
            continue
        
        round_num = event['RoundNumber']
        session = fastf1.get_session(year, round_num, 'R')
        session.load()
        
        # Extract results
        results = session.results
        for _, driver in results.iterrows():
            all_results.append({
                'Year': year,
                'Round': round_num,
                'DriverCode': driver['Abbreviation'],
                'Position': int(driver['Position']),
                'Points': float(driver['Points'])
            })

# Save to CSV
df = pd.DataFrame(all_results)
df.to_csv('./data/raw/race_results.csv', index=False)

Data Sources

FastF1 API

The collector relies on the FastF1 library which fetches data from:

Official F1 Timing Data: Live timing and telemetry
Ergast API: Historical race results and standings
FIA Documents: Official race classifications

Supported Data Types

Race results and classifications
Grid positions and qualifying results
Championship points allocations
Driver and team information
Lap-by-lap timing data

Performance Considerations

Cache Usage: Always enable caching to avoid redundant downloads. The cache significantly speeds up repeated data access.

API Rate Limits: The FastF1 library handles rate limiting automatically, but collecting large datasets may take several minutes.

Network Requirements: Active internet connection required for initial data fetch. Cached data can be accessed offline.

Output Statistics

The collector provides summary statistics upon completion:

print(f"Total results: {len(df)}")
print(f"Valid positions: {df['Position'].notna().sum()}")
print(f"Years: {df['Year'].unique()}")
print(f"Position distribution:")
print(df['Position'].value_counts().sort_index())

Example Output:

Total results: 840
Valid positions: 840
Years: [2023 2024]
✓ Saved to: data/raw/race_results_WORKING.csv

Machine Learning

Data Processing

Simulation

Web API

Overview

FastF1 Integration

Cache Configuration

Data Collection Pipeline

Main Collection Loop

Core Functions

get_event_schedule()

get_session()

get_driver()

Data Extraction

Race Results Extraction

Results DataFrame Fields

Data Structure

Output Record Format

CSV Output Format

race_results_WORKING.csv

Error Handling

Event-Level Error Handling

Data Validation

Usage Example

Data Sources

FastF1 API

Supported Data Types

Performance Considerations

Output Statistics

Build docs developers (and LLMs) love

Machine Learning

Data Processing

Simulation

Web API

​Overview

​FastF1 Integration

​Cache Configuration

​Data Collection Pipeline

​Main Collection Loop

​Core Functions

​get_event_schedule()

​get_session()

​get_driver()

​Data Extraction

​Race Results Extraction

​Results DataFrame Fields

​Data Structure

​Output Record Format

​CSV Output Format

​race_results_WORKING.csv

​Error Handling

​Event-Level Error Handling

​Data Validation

​Usage Example

​Data Sources

​FastF1 API

​Supported Data Types

​Performance Considerations

​Output Statistics

Build docs developers (and LLMs) love

Overview

FastF1 Integration

Cache Configuration

Data Collection Pipeline

Main Collection Loop

Core Functions

get_event_schedule()

get_session()

get_driver()

Data Extraction

Race Results Extraction

Results DataFrame Fields

Data Structure

Output Record Format

CSV Output Format

race_results_WORKING.csv

Error Handling

Event-Level Error Handling

Data Validation

Usage Example

Data Sources

FastF1 API

Supported Data Types

Performance Considerations

Output Statistics