Overview
The Feature Engineering module transforms raw Formula 1 race data into machine learning features. It provides two versions: a base implementation with core historical features and an enhanced version (V2) with weather, tire strategy, and circuit-specific features. Source Files:feature_engineering.py- Base feature engineeringfeature_engineering_v2.py- Enhanced feature engineering with advanced metrics
Feature Engineering Pipeline
Data Loading
Both versions load race data from CSV files:Core Feature Types
Base Features
All feature sets include these fundamental race attributes:Season year of the race
Race round number in the season
Grand Prix name (e.g., “Monaco Grand Prix”)
Three-letter driver abbreviation
Constructor/team name
Starting position on grid (default: 10.0 if missing)
Final race finishing position (target variable)
Championship points earned (default: 0.0)
Historical Features
Driver Historical Performance
Features calculated from driver’s past race results:Average finishing position across all previous racesCalculation:
mean(historical['Position'])Default: 10.0 if no historyAverage points per race from previous performancesCalculation:
mean(historical['Points'])Default: 0.0 if no historyTotal number of race wins before current raceCalculation:
sum(historical['Position'] == 1)Total podium finishes (positions 1-3) before current raceCalculation:
sum(historical['Position'] <= 3)Historical Data Window
Features are calculated using only data before the target race:This prevents data leakage by ensuring the model only uses information that would have been available at race time.
Team Performance Features
Average team finishing position (simplified to 10.0 in base version)
Total team victories (simplified to 0 in base version)
Average team points per race (simplified to 0.0 in base version)
Enhanced Features (V2)
Weather Impact Features
V2 adds comprehensive weather modeling:Weather condition: ‘DRY’, ‘LIGHT_RAIN’, ‘HEAVY_RAIN’Distribution: 80% dry, 15% light rain, 5% heavy rain
Lap time multiplier based on conditionsValues:
- DRY: 1.0 (baseline)
- LIGHT_RAIN: 1.05 (+5% lap time)
- HEAVY_RAIN: 1.15 (+15% lap time)
Binary flag: 1 if rain present, 0 if dry
Tire Strategy Features
Advanced tire compound and degradation modeling:Starting tire compound: ‘SOFT’, ‘MEDIUM’, ‘HARD’Distribution: 50% soft, 40% medium, 10% hard
Tire performance loss per lap (in seconds)Values:
- SOFT: 0.08 seconds/lap
- MEDIUM: 0.05 seconds/lap
- HARD: 0.03 seconds/lap
Calculated optimal lap for pit stopFormula:
int(20 / degradation_rate)Example: Soft tires → lap 250 (20/0.08)Initial tire performance advantageValues:
- SOFT: 1.0 (fastest)
- MEDIUM: 0.8
- HARD: 0.6 (slowest)
Circuit-Specific Features
Track type and familiarity metrics:Circuit classificationCategories:
- STREET: Monaco, Singapore, Baku, Melbourne
- DESERT: Bahrain, Abu Dhabi, Saudi Arabia
- FAST: Silverstone, Monza, Spa
- TECHNICAL: Catalunya, Hungaroring
- STANDARD: All others
Number of times driver has raced at this circuitCalculation: Count of previous races at same circuit
Driver’s average position at this specific circuitFallback: Overall average if no circuit history
Binary flag: 1 if street circuit, 0 otherwise
Binary flag: 1 if high-speed circuit, 0 otherwise
Feature Engineering Functions
create_driver_features()
Generates historical performance features for each driver:Filtered DataFrame containing all races for a specific driver
Single race record for which to generate features
create_circuit_features() (V2)
Generates circuit-specific performance metrics:Driver’s historical race data
Current race record
Mapping of circuit names to types
Data Processing
Missing Value Handling
Both versions implement robust missing value handling:Missing grid positions default to 10.0 (mid-grid) to avoid biasing the model with extreme values.
Data Validation
Ensures target variable integrity:Categorical Encoding (V2)
V2 one-hot encodes categorical variables:Weather_DRY,Weather_LIGHT_RAIN,Weather_HEAVY_RAINTire_SOFT,Tire_MEDIUM,Tire_HARDCircuit_STREET,Circuit_DESERT,Circuit_FAST,Circuit_TECHNICAL,Circuit_STANDARD
Output Format
race_features.csv (Base)
./data/processed/race_features.csv
race_features_v2.csv (Enhanced)
./data/processed/race_features_v2.csv
V2 output includes significantly more columns due to one-hot encoding of categorical variables.
Usage Examples
Feature Statistics
Base Version Output
Enhanced V2 Output
Performance Considerations
Processing Time: Base version processes ~420 records in seconds. V2 takes slightly longer due to additional calculations.
Memory Usage: V2 uses more memory due to one-hot encoding. Expect ~3x column count compared to base version.
Data Quality: Always validate that Position column has no null values before training. Invalid records are automatically filtered.
Next Steps
After feature engineering, the data is ready for model training:- Weather impact analysis
- Tire strategy optimization
- Circuit-specific predictions
- More robust feature set for complex modeling