Skip to main content

Overview

Feature engineering transforms raw race data into meaningful predictive features. The system uses time-series feature engineering to create historical statistics without data leakage.

20+ Features

Driver, team, circuit, weather, and tire features

Time-Aware

Features use only historical data before each race

Zero Leakage

Strict time-based splitting prevents future data usage

Auto-Scaling

Features automatically adapt to new seasons

Feature Engineering Pipeline

Version 1: Basic Features

File: feature_engineering.py Creates foundational driver and team features from historical performance.

Version 2: Enhanced Features

File: feature_engineering_v2.py Adds weather, tire strategy, and circuit-specific features for improved accuracy.

Feature Categories

1. Driver Performance Features

Historical performance metrics for each driver:
features = {
    'Driver_AvgPosition': float(historical['Position'].mean()),
    'Driver_AvgPoints': float(historical['Points'].mean()),
    'Driver_TotalWins': int((historical['Position'] == 1).sum()),
    'Driver_TotalPodiums': int((historical['Position'] <= 3).sum()),
    'Driver_DNFRate': historical['Position'].isna().mean()
}
Driver_AvgPosition
  • Average finishing position across all previous races
  • Lower is better (1.0 = always wins)
  • Default: 10.0 for rookies
Driver_AvgPoints
  • Average points per race
  • Includes zero-point finishes
  • Indicates consistency
Driver_TotalWins
  • Career wins before this race
  • Strong predictor of future wins
  • Champions typically have 20+ wins
Driver_TotalPodiums
  • Career top-3 finishes
  • More stable than wins alone
  • Good indicator of peak performance
Driver_DNFRate
  • Percentage of races with DNF (Did Not Finish)
  • Indicates reliability/consistency
  • Lower is better

2. Recent Form Features

Captured driver momentum using rolling windows:
# Last 5 races performance
last_5 = historical.tail(5)
features['Driver_Last5_AvgPosition'] = last_5['Position'].mean()
features['Driver_Last5_AvgPoints'] = last_5['Points'].mean()
Recent form (last 5 races) is often more predictive than career statistics, especially after mid-season car upgrades.

3. Circuit-Specific Features

Driver performance at specific circuits:
circuit_history = historical[historical['EventName'] == race['EventName']]

features['Driver_CircuitExperience'] = len(circuit_history)
features['Driver_CircuitAvgPosition'] = circuit_history['Position'].mean()
Why it matters:
  • Monaco specialists (e.g., Alonso) outperform expectations
  • Monza favorites (e.g., McLaren) have circuit-specific advantages
  • Street circuits reward experience

4. Grid Position Features

features['GridPosition'] = float(race['GridPosition']) if pd.notna(race['GridPosition']) else 10.0
features['Driver_AvgGridPosition'] = historical['GridPosition'].mean()
features['Driver_GridGain'] = features['Driver_AvgGridPosition'] - features['Driver_AvgPosition']
Grid Gain = Average starting position - Average finishing position
  • Positive Grid Gain: Driver typically gains positions (overtaker)
  • Negative Grid Gain: Driver loses positions (qualifier but not racer)
  • Zero Grid Gain: Maintains grid position
Example:
  • Starts P10 on average, finishes P6 on average → Grid Gain = +4
  • Indicates strong race pace and overtaking ability

5. Team Performance Features

# Team-level aggregations
team_history = race_results[
    (race_results['TeamName'] == race['TeamName']) &
    (race_results['Year'] < race['Year'] | 
     (race_results['Year'] == race['Year'] & race_results['Round'] < race['Round']))
]

features['Team_AvgPosition'] = team_history['Position'].mean()
features['Team_TotalWins'] = (team_history['Position'] == 1).sum()
features['Team_AvgPoints'] = team_history['Points'].mean()

6. Weather Features (V2)

Enhanced model includes weather-aware features:
features['Weather_Impact'] = 1.0  # DRY baseline
if weather == 'LIGHT_RAIN':
    features['Weather_Impact'] = 1.05
elif weather == 'HEAVY_RAIN':
    features['Weather_Impact'] = 1.15

features['Is_Wet_Race'] = 1 if weather != 'DRY' else 0

# One-hot encoding
for w in ['DRY', 'LIGHT_RAIN', 'HEAVY_RAIN']:
    features[f'Weather_{w}'] = 1 if weather == w else 0

DRY

Impact: 1.0xGrid position dominant Pole wins ~42%

LIGHT_RAIN

Impact: 1.05xSkill matters more Pole wins ~35%

HEAVY_RAIN

Impact: 1.15xHigh chaos factor Pole wins ~23%

7. Tire Strategy Features (V2)

Tire compound characteristics:
TIRE_DEGRADATION = {
    'SOFT': 0.08,   # seconds per lap
    'MEDIUM': 0.05,
    'HARD': 0.03
}

OPTIMAL_PIT_LAP = {
    'SOFT': 25,
    'MEDIUM': 35,
    'HARD': 45
}

features['Tire_Degradation_Rate'] = TIRE_DEGRADATION[tire]
features['Optimal_Pit_Lap'] = OPTIMAL_PIT_LAP[tire]
features['Tire_Advantage'] = 1.0 if tire == 'SOFT' else 0.8 if tire == 'MEDIUM' else 0.6

# One-hot encoding
for t in ['SOFT', 'MEDIUM', 'HARD']:
    features[f'Tire_{t}'] = 1 if tire == t else 0

8. Circuit Type Features (V2)

features['Is_Street_Circuit'] = 1 if circuit == 'STREET' else 0
features['Is_High_Speed'] = 1 if circuit == 'FAST' else 0

# One-hot encoding
for c in ['DESERT', 'FAST', 'STANDARD', 'STREET', 'TECHNICAL']:
    features[f'Circuit_{c}'] = 1 if circuit == c else 0

Feature Creation Process

Time-Series Feature Engineering

The critical innovation is preventing data leakage:
for driver in race_results['DriverCode'].unique():
    driver_data = race_results[race_results['DriverCode'] == driver].copy()
    driver_data = driver_data.sort_values(['Year', 'Round'])
    
    for idx, race in driver_data.iterrows():
        # Get historical data BEFORE this race
        historical = driver_data[
            (driver_data['Year'] < race['Year']) |
            ((driver_data['Year'] == race['Year']) & 
             (driver_data['Round'] < race['Round']))
        ]
        
        # Skip if no history
        if len(historical) == 0:
            continue
        
        # Create features using ONLY historical data
        features = create_features(historical, race)
Critical: Features for race N use only data from races 1 to N-1. This prevents the model from “seeing the future” during training.

Missing Value Handling

# Rookie drivers have no history - use defaults
features['Driver_AvgPosition'] = historical['Position'].mean() if len(historical) > 0 else 10.0
features['Driver_AvgPoints'] = historical['Points'].mean() if len(historical) > 0 else 0.0
features['Driver_TotalWins'] = int((historical['Position'] == 1).sum())

# Fill any remaining NaN values
for col in features.columns:
    if features[col].dtype in ['float64', 'int64']:
        features[col] = features[col].fillna(0)

Feature Importance

From trained models, top features by importance:

Top 10 Most Important Features

RankFeatureImportanceDescription
1GridPosition0.2847Starting grid position
2Driver_AvgPosition0.1523Historical average finish
3Driver_TotalWins0.0892Career wins
4Team_AvgPosition0.0745Team performance
5Driver_Last5_AvgPosition0.0634Recent form
6Driver_CircuitAvgPosition0.0521Circuit-specific performance
7Weather_Impact0.0487Weather multiplier
8Tire_Degradation_Rate0.0412Tire compound effect
9Driver_AvgPoints0.0398Average points per race
10Is_Wet_Race0.0367Rain flag

Feature Groups by Importance

  • GridPosition: 28.5%
  • Driver_AvgPosition: 15.2%
These two features alone explain 43.7% of predictions.

Feature Validation

Quality Checks

print(f"Checking Position column:")
print(f"   Total records: {len(features_df)}")
print(f"   Position not null: {features_df['Position'].notna().sum()}")
print(f"   Position null: {features_df['Position'].isna().sum()}")

# Remove rows where Position is missing
features_df = features_df[features_df['Position'].notna()]

Feature Statistics

print(f"Final records: {len(features_df)}")
print(f"Columns: {len(features_df.columns)}")
print(f"\nFeature ranges:")
for col in numeric_features:
    print(f"   {col}: [{features_df[col].min()}, {features_df[col].max()}]")

Output Files

V1 Features: data/processed/race_features.csv
  • Basic features (driver, team, grid)
  • ~880 records (2023-2024 seasons)
  • ~15 feature columns
V2 Features: data/processed/race_features_v2.csv
  • Enhanced features (weather, tires, circuits)
  • Same record count
  • ~30+ feature columns

Running Feature Engineering

Basic Version

python feature_engineering.py

Enhanced Version (V2)

python feature_engineering_v2.py

Expected Output

============================================================
F1 FEATURE ENGINEERING - FIXED
============================================================

📂 Loading data...
✓ Loaded 880 race records

🏎️ Creating features...
✓ Created 756 records

Checking Position column:
   Total records: 756
   Position not null: 756
   Position null: 0

💾 Saved: data/processed/race_features.csv
   Records: 756
   Columns: 15

✅ COMPLETE! Ready for training!
============================================================

Next Steps

After feature engineering:
  1. Validate Features → Check distributions and correlations
  2. Train Models → Use processed features for ML training (see Models)
  3. Feature Selection → Optionally remove low-importance features

Advanced Techniques

Feature Scaling

Not required for tree-based models (Random Forest, XGBoost), but available if needed:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

Feature Selection

Remove low-importance features (less than 1% importance):
importances = model.feature_importances_
mask = importances > 0.01
selected_features = feature_cols[mask]

Interaction Features

Create composite features (future enhancement):
features['Grid_x_DriverAvg'] = features['GridPosition'] * features['Driver_AvgPosition']
features['Weather_x_WetSkill'] = features['Is_Wet_Race'] * features['Driver_WetSkill']

Build docs developers (and LLMs) love