Skip to main content

Overview

The WinnerPredictor class provides machine learning-based predictions for Formula 1 race winners using Random Forest and XGBoost models. It predicts top-K finishes (typically top-3) based on historical driver performance, team statistics, weather conditions, and circuit characteristics. Source: winner_predictor.py:17

Class: WinnerPredictor

Constructor

predictor = WinnerPredictor()
Initializes a new winner predictor instance with empty model states. Attributes:
  • rf_model: Random Forest classifier (initially None)
  • xgb_model: XGBoost classifier (initially None)
  • feature_columns: List of feature names used for training
  • label_map: Label mapping dictionary

Methods

load_data

data = predictor.load_data(data_path='./data/processed/race_features.csv')
Loads engineered race features from a CSV file and filters for finished races only.
data_path
string
default:"./data/processed/race_features.csv"
Path to the CSV file containing processed race features
data
pandas.DataFrame
DataFrame containing filtered race results with valid Position values
Example:
predictor = WinnerPredictor()
data = predictor.load_data('./data/processed/race_features.csv')
print(f"Loaded {len(data)} race results")
Output:
📂 Loading data...
   ✓ Loaded 2847 race results

prepare_features

features = predictor.prepare_features()
Selects and prepares feature columns for model training. Includes driver statistics, team performance, and weather data.
features
list
List of available feature column names used for training
Features included:
  • Grid Data: GridPosition
  • Driver Stats: Driver_AvgPosition, Driver_AvgPoints, Driver_TotalWins, Driver_TotalPodiums, Driver_DNFRate, Driver_Last5_AvgPosition, Driver_Last5_AvgPoints, Driver_CircuitExperience, Driver_CircuitAvgPosition, Driver_AvgGridPosition, Driver_GridGain
  • Team Stats: Team_AvgPosition, Team_TotalWins, Team_TotalPodiums, Team_AvgPoints, Team_Last5_AvgPosition, Team_Last5_AvgPoints
  • Weather: AvgAirTemp, AvgTrackTemp, AvgHumidity, IsRaining
Example:
features = predictor.prepare_features()
print(f"Using {len(features)} features")
Output:
🔧 Preparing features...
   ✓ Using 22 features

create_target

target = predictor.create_target(top_k=3)
Creates binary classification target variable: 1 for top-K finish, 0 otherwise.
top_k
int
default:"3"
Number of top positions to classify as positive (e.g., 3 for podium prediction)
target
pandas.Series
Binary series where 1 = top-K finish, 0 = not top-K
Example:
target = predictor.create_target(top_k=3)
print(f"Top-3: {target.sum()}, Rest: {(1 - target).sum()}")
Output:
🎯 Creating target: Top 3 finish
   Top 3: 854 instances
   Rest: 1993 instances

split_data

X_train, X_test, y_train, y_test = predictor.split_data(test_size=0.2)
Splits data into training and test sets using time-based splitting (chronological order).
test_size
float
default:"0.2"
Fraction of data to use for testing (0.0 to 1.0)
X_train
pandas.DataFrame
Training feature matrix
X_test
pandas.DataFrame
Test feature matrix
y_train
pandas.Series
Training target labels
y_test
pandas.Series
Test target labels
Example:
X_train, X_test, y_train, y_test = predictor.split_data(test_size=0.2)
Output:
✂️ Splitting data (80% train, 20% test)...
   Training set: 2277 samples
   Test set: 570 samples

train_random_forest

model = predictor.train_random_forest(n_estimators=100)
Trains a Random Forest classifier for winner prediction.
n_estimators
int
default:"100"
Number of trees in the random forest
model
RandomForestClassifier
Trained Random Forest model instance
Model Configuration:
  • n_estimators: 100 (configurable)
  • max_depth: 10
  • min_samples_split: 10
  • min_samples_leaf: 5
  • random_state: 42
  • n_jobs: -1 (use all CPU cores)
Example:
rf_model = predictor.train_random_forest(n_estimators=150)
Output:
🌲 Training Random Forest...
   Training accuracy: 0.892
   Test accuracy: 0.847

train_xgboost

model = predictor.train_xgboost()
Trains an XGBoost classifier for winner prediction.
model
XGBClassifier
Trained XGBoost model instance
Model Configuration:
  • n_estimators: 100
  • max_depth: 6
  • learning_rate: 0.1
  • random_state: 42
  • eval_metric: ‘logloss’
Example:
xgb_model = predictor.train_xgboost()
Output:
🚀 Training XGBoost...
   Training accuracy: 0.905
   Test accuracy: 0.853

evaluate

predictor.evaluate()
Performs comprehensive model evaluation including classification reports and confusion matrices. Saves visualizations to ./models/confusion_matrices.png. Example:
predictor.evaluate()
Output:
📊 Model Evaluation:
============================================================

🌲 RANDOM FOREST:
              precision    recall  f1-score   support
  Not Top-3       0.93      0.95      0.94       456
      Top-3       0.78      0.71      0.74       114
   accuracy                           0.90       570

🚀 XGBOOST:
              precision    recall  f1-score   support
  Not Top-3       0.94      0.96      0.95       456
      Top-3       0.82      0.75      0.78       114
   accuracy                           0.92       570

   ✓ Confusion matrices saved to ./models/confusion_matrices.png

feature_importance

predictor.feature_importance()
Analyzes and visualizes feature importance for both models. Saves plots to ./models/feature_importance.png. Example:
predictor.feature_importance()
Output:
📈 Feature Importance:
   ✓ Feature importance saved to ./models/feature_importance.png

🔝 Top 5 Features (Random Forest):
   GridPosition                  : 0.2847
   Driver_AvgPosition            : 0.1653
   Team_AvgPosition              : 0.1205
   Driver_TotalWins              : 0.0892
   Driver_Last5_AvgPosition      : 0.0734

save_models

predictor.save_models(output_dir='./models/saved_models')
Saves trained models and feature columns to disk using joblib.
output_dir
string
default:"./models/saved_models"
Directory path where models will be saved
Saved Files:
  • winner_predictor_rf.pkl: Random Forest model
  • winner_predictor_xgb.pkl: XGBoost model
  • feature_columns.pkl: List of feature column names
Example:
predictor.save_models('./models/production')
Output:
💾 Saving models...
   ✓ Models saved successfully

predict_race

prediction = predictor.predict_race(race_features)
Predicts race outcome for new race data using an ensemble of both models.
race_features
pandas.DataFrame
DataFrame containing feature values for the race to predict
prediction
dict
Dictionary containing prediction results with the following keys:
  • rf_probability (float): Random Forest probability for top-K finish
  • xgb_probability (float): XGBoost probability for top-K finish
  • ensemble_probability (float): Average probability from both models
  • prediction (int): Binary prediction (1 = top-K, 0 = not top-K)
Example:
# Create sample race features
race_data = pd.DataFrame([{
    'GridPosition': 3,
    'Driver_AvgPosition': 4.2,
    'Driver_AvgPoints': 12.5,
    'Driver_TotalWins': 5,
    'Team_AvgPosition': 3.1,
    # ... other features
}])

prediction = predictor.predict_race(race_data)
print(f"Ensemble probability: {prediction['ensemble_probability']:.2%}")
print(f"Prediction: {'Top-3' if prediction['prediction'] else 'Not Top-3'}")
Output:
{
    'rf_probability': 0.78,
    'xgb_probability': 0.82,
    'ensemble_probability': 0.80,
    'prediction': 1
}

Complete Usage Example

from winner_predictor import WinnerPredictor

# Initialize predictor
predictor = WinnerPredictor()

# Load and prepare data
predictor.load_data('./data/processed/race_features.csv')
predictor.prepare_features()
predictor.create_target(top_k=3)
predictor.split_data(test_size=0.2)

# Train models
predictor.train_random_forest(n_estimators=100)
predictor.train_xgboost()

# Evaluate performance
predictor.evaluate()
predictor.feature_importance()

# Save for production
predictor.save_models('./models/saved_models')

Model Performance

Typical Performance:
  • Random Forest Test Accuracy: ~85-90%
  • XGBoost Test Accuracy: ~86-92%
  • Ensemble Performance: ~87-93%
Class Balance:
  • Top-3 finishes: ~30% of data
  • Non-podium: ~70% of data

Notes

  • The model uses time-based splitting to prevent data leakage - training data is always from earlier races than test data
  • Missing values are filled with sensible defaults (0 for most features, 10.0 for positions)
  • Both models are trained with fixed random seeds (random_state=42) for reproducibility
  • The ensemble prediction averages probabilities from both Random Forest and XGBoost models
  • Classification threshold is 0.5 for binary predictions

Build docs developers (and LLMs) love