Overview
The data collection system uses the FastF1 API to gather comprehensive Formula 1 race data from multiple seasons. The primary collection script iscollect_working.py, which reliably extracts race results and driver information.
FastF1 is a Python package that provides access to F1 timing data, telemetry, and race information from the official F1 API.
Data Sources
FastF1 API
The system leverages FastF1’s comprehensive data access:Race Results
Final positions, points, and grid positions for all drivers
Driver Info
Full names, team assignments, and driver codes (e.g., VER, HAM)
Session Data
Lap times, pit stops, and sector times
Event Metadata
Event names, years, rounds, and circuit information
Data Coverage
- Complete race results
- Reliable position data
- Full driver roster information
- Minimal data quality issues
Collection Process
Step 1: Initialize Cache
Caching is critical for performance. Without it, each API call takes 30+ seconds. With caching, subsequent calls are instant.
Step 2: Fetch Event Schedule
For each season, retrieve the full race calendar:Step 3: Load Race Session
Load the race session (not practice or qualifying):Session Types
Session Types
- ‘FP1’, ‘FP2’, ‘FP3’: Free Practice sessions
- ‘Q’: Qualifying
- ‘S’: Sprint race
- ‘R’: Main race (what we use)
Step 4: Extract Race Results
Step 5: Data Validation
Critical validation step to ensure data quality:Rows with missing positions are excluded to prevent training issues. The model requires valid target values.
Data Schema
Raw Race Results
Stored in:data/raw/race_results.csv
| Column | Type | Description | Example |
|---|---|---|---|
| Year | int | Season year | 2024 |
| Round | int | Race number in season | 5 |
| EventName | str | Race name | ”Monaco Grand Prix” |
| DriverCode | str | 3-letter driver code | ”VER” |
| FullName | str | Driver full name | ”Max Verstappen” |
| TeamName | str | Constructor team | ”Red Bull” |
| Position | int | Final race position | 1 |
| GridPosition | int | Starting grid position | 1 |
| Points | float | Championship points | 25.0 |
Additional Data Files
The system also collects (when available): Lap Times (data/raw/lap_times.csv):
- Individual lap times for each driver
- Used for pace analysis
data/raw/pit_stops.csv):
- Pit stop timing and duration
- Used for strategy analysis
data/raw/weather.csv):
- Track temperature
- Air temperature
- Humidity
- Rain conditions
Error Handling
Graceful Degradation
The collection script handles errors at multiple levels:Network Issues
Retries API calls with exponential backoff
Missing Data
Skips incomplete races, logs warnings
API Rate Limits
Uses caching to minimize API calls
Format Changes
Handles different data schemas gracefully
Collection Statistics
Typical collection run (2023-2024 seasons):Data Quality Checks
Before saving, the system validates:Position Validation
Position Validation
Grid Position Bounds
Grid Position Bounds
Points Distribution
Points Distribution
Performance Optimization
Caching Strategy
- First API call: ~30 seconds
- Cached calls: Less than 0.1 seconds
- Cache size: ~500 MB per season
Batch Processing
Processes entire seasons in one run:- Reduces connection overhead
- Better error recovery
- Progress tracking
Selective Loading
Running Data Collection
Basic Usage
Expected Output
Output Files
After collection, files are stored indata/raw/:
race_results.csvorrace_results_WORKING.csvlap_times.csv(optional)pit_stops.csv(optional)weather.csv(optional)
The
_WORKING suffix indicates files from the reliable collection script (collect_working.py).Next Steps
After data collection:- Feature Engineering → Transform raw data into ML features (see Feature Engineering)
- Model Training → Train prediction models (see Models)
- Validation → Check data quality with
debug_data.py
Troubleshooting
No data collected
No data collected
Cause: API connection issues or cache corruptionSolution:
Missing positions
Missing positions
Cause: Race session not completed or data not availableSolution: Check if race actually occurred, try different year/round
Slow performance
Slow performance
Cause: Cache not enabledSolution: Verify
fastf1.Cache.enable_cache() is called before API calls