Skip to main content

Overview

The Feature Engineering module transforms raw Formula 1 race data into machine learning features. It provides two versions: a base implementation with core historical features and an enhanced version (V2) with weather, tire strategy, and circuit-specific features. Source Files:
  • feature_engineering.py - Base feature engineering
  • feature_engineering_v2.py - Enhanced feature engineering with advanced metrics

Feature Engineering Pipeline

Data Loading

Both versions load race data from CSV files:
import pandas as pd

# Load race results
race_results = pd.read_csv('./data/raw/race_results.csv')
lap_times = pd.read_csv('./data/raw/lap_times.csv')
pit_stops = pd.read_csv('./data/raw/pit_stops.csv')
weather = pd.read_csv('./data/raw/weather.csv')

Core Feature Types

Base Features

All feature sets include these fundamental race attributes:
Year
int
Season year of the race
Round
int
Race round number in the season
EventName
str
Grand Prix name (e.g., “Monaco Grand Prix”)
DriverCode
str
Three-letter driver abbreviation
TeamName
str
Constructor/team name
GridPosition
float
Starting position on grid (default: 10.0 if missing)
Position
float
Final race finishing position (target variable)
Points
float
Championship points earned (default: 0.0)

Historical Features

Driver Historical Performance

Features calculated from driver’s past race results:
Driver_AvgPosition
float
Average finishing position across all previous racesCalculation: mean(historical['Position'])Default: 10.0 if no history
Driver_AvgPoints
float
Average points per race from previous performancesCalculation: mean(historical['Points'])Default: 0.0 if no history
Driver_TotalWins
int
Total number of race wins before current raceCalculation: sum(historical['Position'] == 1)
Driver_TotalPodiums
int
Total podium finishes (positions 1-3) before current raceCalculation: sum(historical['Position'] <= 3)

Historical Data Window

Features are calculated using only data before the target race:
# Get historical data BEFORE this race
historical = driver_data[
    (driver_data['Year'] < race['Year']) |
    ((driver_data['Year'] == race['Year']) & (driver_data['Round'] < race['Round']))
]
This prevents data leakage by ensuring the model only uses information that would have been available at race time.

Team Performance Features

Team_AvgPosition
float
Average team finishing position (simplified to 10.0 in base version)
Team_TotalWins
int
Total team victories (simplified to 0 in base version)
Team_AvgPoints
float
Average team points per race (simplified to 0.0 in base version)

Enhanced Features (V2)

Weather Impact Features

V2 adds comprehensive weather modeling:
Weather
str
Weather condition: ‘DRY’, ‘LIGHT_RAIN’, ‘HEAVY_RAIN’Distribution: 80% dry, 15% light rain, 5% heavy rain
Weather_Impact
float
Lap time multiplier based on conditionsValues:
  • DRY: 1.0 (baseline)
  • LIGHT_RAIN: 1.05 (+5% lap time)
  • HEAVY_RAIN: 1.15 (+15% lap time)
Is_Wet_Race
int
Binary flag: 1 if rain present, 0 if dry
Weather Impact Constants:
WEATHER_IMPACT = {
    'DRY': 1.0,
    'LIGHT_RAIN': 1.05,
    'HEAVY_RAIN': 1.15
}

Tire Strategy Features

Advanced tire compound and degradation modeling:
Starting_Tire
str
Starting tire compound: ‘SOFT’, ‘MEDIUM’, ‘HARD’Distribution: 50% soft, 40% medium, 10% hard
Tire_Degradation_Rate
float
Tire performance loss per lap (in seconds)Values:
  • SOFT: 0.08 seconds/lap
  • MEDIUM: 0.05 seconds/lap
  • HARD: 0.03 seconds/lap
Optimal_Pit_Lap
int
Calculated optimal lap for pit stopFormula: int(20 / degradation_rate)Example: Soft tires → lap 250 (20/0.08)
Tire_Advantage
float
Initial tire performance advantageValues:
  • SOFT: 1.0 (fastest)
  • MEDIUM: 0.8
  • HARD: 0.6 (slowest)
Tire Degradation Constants:
TIRE_DEGRADATION = {
    'SOFT': 0.08,
    'MEDIUM': 0.05,
    'HARD': 0.03
}

Circuit-Specific Features

Track type and familiarity metrics:
Circuit_Type
str
Circuit classificationCategories:
  • STREET: Monaco, Singapore, Baku, Melbourne
  • DESERT: Bahrain, Abu Dhabi, Saudi Arabia
  • FAST: Silverstone, Monza, Spa
  • TECHNICAL: Catalunya, Hungaroring
  • STANDARD: All others
Circuit_Familiarity
int
Number of times driver has raced at this circuitCalculation: Count of previous races at same circuit
Circuit_AvgPosition
float
Driver’s average position at this specific circuitFallback: Overall average if no circuit history
Is_Street_Circuit
int
Binary flag: 1 if street circuit, 0 otherwise
Is_High_Speed
int
Binary flag: 1 if high-speed circuit, 0 otherwise
Circuit Type Mapping:
CIRCUIT_TYPES = {
    'Monaco': 'STREET',
    'Singapore': 'STREET',
    'Silverstone': 'FAST',
    'Monza': 'FAST',
    'Catalunya': 'TECHNICAL',
    # ... more circuits
}

Feature Engineering Functions

create_driver_features()

Generates historical performance features for each driver:
def create_driver_features(driver_data, race):
    """Create driver-specific features from historical data"""
    historical = driver_data[
        (driver_data['Year'] < race['Year']) |
        ((driver_data['Year'] == race['Year']) & 
         (driver_data['Round'] < race['Round']))
    ]
    
    return {
        'Driver_AvgPosition': float(historical['Position'].mean()),
        'Driver_AvgPoints': float(historical['Points'].mean()),
        'Driver_TotalWins': int((historical['Position'] == 1).sum()),
        'Driver_TotalPodiums': int((historical['Position'] <= 3).sum())
    }
driver_data
DataFrame
required
Filtered DataFrame containing all races for a specific driver
race
Series
required
Single race record for which to generate features
Returns: Dictionary of driver historical features

create_circuit_features() (V2)

Generates circuit-specific performance metrics:
def create_circuit_features(historical, race, circuit_types):
    """Create circuit-specific features"""
    circuit_name = race['EventName'].split(' ')[0]
    circuit_type = circuit_types.get(circuit_name, 'STANDARD')
    
    circuit_races = historical[historical['EventName'] == race['EventName']]
    circuit_avg_pos = (
        circuit_races['Position'].mean() 
        if len(circuit_races) > 0 
        else historical['Position'].mean()
    )
    
    return {
        'Circuit_Type': circuit_type,
        'Circuit_Familiarity': len(circuit_races),
        'Circuit_AvgPosition': circuit_avg_pos,
        'Is_Street_Circuit': 1 if circuit_type == 'STREET' else 0,
        'Is_High_Speed': 1 if circuit_type == 'FAST' else 0
    }
historical
DataFrame
required
Driver’s historical race data
race
Series
required
Current race record
circuit_types
dict
required
Mapping of circuit names to types
Returns: Dictionary of circuit-related features

Data Processing

Missing Value Handling

Both versions implement robust missing value handling:
# Handle missing grid positions
'GridPosition': float(race['GridPosition']) if pd.notna(race['GridPosition']) else 10.0

# Handle missing points
'Points': float(race['Points']) if pd.notna(race['Points']) else 0.0

# Fill any remaining missing numeric values
for col in features_df.columns:
    if features_df[col].dtype in ['float64', 'int64']:
        features_df[col] = features_df[col].fillna(0)
Missing grid positions default to 10.0 (mid-grid) to avoid biasing the model with extreme values.

Data Validation

Ensures target variable integrity:
# Remove rows where Position is missing
features_df = features_df[features_df['Position'].notna()]

print(f"Position not null: {features_df['Position'].notna().sum()}")
print(f"Position null: {features_df['Position'].isna().sum()}")

Categorical Encoding (V2)

V2 one-hot encodes categorical variables:
# One-hot encode Weather
weather_dummies = pd.get_dummies(features_df['Weather'], prefix='Weather')
features_df = pd.concat([features_df, weather_dummies], axis=1)

# One-hot encode Starting Tire
tire_dummies = pd.get_dummies(features_df['Starting_Tire'], prefix='Tire')
features_df = pd.concat([features_df, tire_dummies], axis=1)

# One-hot encode Circuit Type
circuit_dummies = pd.get_dummies(features_df['Circuit_Type'], prefix='Circuit')
features_df = pd.concat([features_df, circuit_dummies], axis=1)

# Drop original categorical columns
features_df = features_df.drop(['Weather', 'Starting_Tire', 'Circuit_Type'], axis=1)
Resulting Columns:
  • Weather_DRY, Weather_LIGHT_RAIN, Weather_HEAVY_RAIN
  • Tire_SOFT, Tire_MEDIUM, Tire_HARD
  • Circuit_STREET, Circuit_DESERT, Circuit_FAST, Circuit_TECHNICAL, Circuit_STANDARD

Output Format

race_features.csv (Base)

Year,Round,EventName,DriverCode,TeamName,GridPosition,Position,Points,Driver_AvgPosition,Driver_AvgPoints,Driver_TotalWins,Driver_TotalPodiums,Team_AvgPosition,Team_TotalWins,Team_AvgPoints
2024,2,Saudi Arabian Grand Prix,VER,Red Bull Racing,1.0,1.0,25.0,1.5,24.5,15,18,10.0,0,0.0
File Location: ./data/processed/race_features.csv

race_features_v2.csv (Enhanced)

Year,Round,EventName,DriverCode,TeamName,GridPosition,Position,Points,Driver_AvgPosition,Driver_AvgPoints,Driver_TotalWins,Driver_TotalPodiums,Weather_Impact,Is_Wet_Race,Tire_Degradation_Rate,Optimal_Pit_Lap,Tire_Advantage,Circuit_Familiarity,Circuit_AvgPosition,Is_Street_Circuit,Is_High_Speed,Team_AvgPosition,Team_TotalWins,Team_AvgPoints,Weather_DRY,Weather_LIGHT_RAIN,Weather_HEAVY_RAIN,Tire_SOFT,Tire_MEDIUM,Tire_HARD,Circuit_STREET,Circuit_DESERT,Circuit_FAST,Circuit_TECHNICAL,Circuit_STANDARD
2024,2,Saudi Arabian Grand Prix,VER,Red Bull Racing,1.0,1.0,25.0,1.5,24.5,15,18,1.0,0,0.08,250,1.0,2,1.0,0,0,10.0,0,0.0,1,0,0,1,0,0,0,1,0,0,0
File Location: ./data/processed/race_features_v2.csv
V2 output includes significantly more columns due to one-hot encoding of categorical variables.

Usage Examples

import pandas as pd
import numpy as np

# Load data
race_results = pd.read_csv('./data/raw/race_results.csv')

features_list = []

# Process each driver
for driver in race_results['DriverCode'].unique():
    driver_data = race_results[race_results['DriverCode'] == driver]
    driver_data = driver_data.sort_values(['Year', 'Round'])
    
    for idx, race in driver_data.iterrows():
        # Get historical data
        historical = driver_data[
            (driver_data['Year'] < race['Year']) |
            ((driver_data['Year'] == race['Year']) & 
             (driver_data['Round'] < race['Round']))
        ]
        
        if len(historical) == 0:
            continue
        
        # Create features
        features = {
            'Year': race['Year'],
            'DriverCode': driver,
            'GridPosition': float(race['GridPosition']) if pd.notna(race['GridPosition']) else 10.0,
            'Position': float(race['Position']),
            'Driver_AvgPosition': float(historical['Position'].mean()),
            'Driver_TotalWins': int((historical['Position'] == 1).sum())
        }
        
        features_list.append(features)

# Create DataFrame and save
df = pd.DataFrame(features_list)
df.to_csv('./data/processed/race_features.csv', index=False)

Feature Statistics

Base Version Output

✓ Created 420 records
Position not null: 420
Position null: 0
Final records: 420

💾 Saved: data/processed/race_features.csv
   Records: 420
   Columns: 15

Enhanced V2 Output

✓ Created 420 records
New features added:
   • Weather: 3 conditions
   • Tire types: 3 compounds
   • Circuit types: 5 categories

✓ Encoded variables
   Total features now: 34

💾 Saved: data/processed/race_features_v2.csv
   Records: 420
   Features: 34

Performance Considerations

Processing Time: Base version processes ~420 records in seconds. V2 takes slightly longer due to additional calculations.
Memory Usage: V2 uses more memory due to one-hot encoding. Expect ~3x column count compared to base version.
Data Quality: Always validate that Position column has no null values before training. Invalid records are automatically filtered.

Next Steps

After feature engineering, the data is ready for model training:
# Base features
python train_model.py

# Enhanced V2 features
python train_model_v2.py
Enhanced V2 Capabilities:
  • Weather impact analysis
  • Tire strategy optimization
  • Circuit-specific predictions
  • More robust feature set for complex modeling

Build docs developers (and LLMs) love