Overview
The WinnerPredictor class provides machine learning-based predictions for Formula 1 race winners using Random Forest and XGBoost models. It predicts top-K finishes (typically top-3) based on historical driver performance, team statistics, weather conditions, and circuit characteristics.
Source: winner_predictor.py:17
Class: WinnerPredictor
Constructor
predictor = WinnerPredictor()
Initializes a new winner predictor instance with empty model states.
Attributes:
rf_model: Random Forest classifier (initially None)
xgb_model: XGBoost classifier (initially None)
feature_columns: List of feature names used for training
label_map: Label mapping dictionary
Methods
load_data
data = predictor.load_data( data_path = './data/processed/race_features.csv' )
Loads engineered race features from a CSV file and filters for finished races only.
data_path
string
default: "./data/processed/race_features.csv"
Path to the CSV file containing processed race features
DataFrame containing filtered race results with valid Position values
Example:
predictor = WinnerPredictor()
data = predictor.load_data( './data/processed/race_features.csv' )
print ( f "Loaded { len (data) } race results" )
Output:
📂 Loading data...
✓ Loaded 2847 race results
prepare_features
features = predictor.prepare_features()
Selects and prepares feature columns for model training. Includes driver statistics, team performance, and weather data.
List of available feature column names used for training
Features included:
Grid Data: GridPosition
Driver Stats: Driver_AvgPosition, Driver_AvgPoints, Driver_TotalWins, Driver_TotalPodiums, Driver_DNFRate, Driver_Last5_AvgPosition, Driver_Last5_AvgPoints, Driver_CircuitExperience, Driver_CircuitAvgPosition, Driver_AvgGridPosition, Driver_GridGain
Team Stats: Team_AvgPosition, Team_TotalWins, Team_TotalPodiums, Team_AvgPoints, Team_Last5_AvgPosition, Team_Last5_AvgPoints
Weather: AvgAirTemp, AvgTrackTemp, AvgHumidity, IsRaining
Example:
features = predictor.prepare_features()
print ( f "Using { len (features) } features" )
Output:
🔧 Preparing features...
✓ Using 22 features
create_target
target = predictor.create_target( top_k = 3 )
Creates binary classification target variable: 1 for top-K finish, 0 otherwise.
Number of top positions to classify as positive (e.g., 3 for podium prediction)
Binary series where 1 = top-K finish, 0 = not top-K
Example:
target = predictor.create_target( top_k = 3 )
print ( f "Top-3: { target.sum() } , Rest: { ( 1 - target).sum() } " )
Output:
🎯 Creating target: Top 3 finish
Top 3: 854 instances
Rest: 1993 instances
split_data
X_train, X_test, y_train, y_test = predictor.split_data( test_size = 0.2 )
Splits data into training and test sets using time-based splitting (chronological order).
Fraction of data to use for testing (0.0 to 1.0)
Example:
X_train, X_test, y_train, y_test = predictor.split_data( test_size = 0.2 )
Output:
✂️ Splitting data (80% train, 20% test)...
Training set: 2277 samples
Test set: 570 samples
train_random_forest
model = predictor.train_random_forest( n_estimators = 100 )
Trains a Random Forest classifier for winner prediction.
Number of trees in the random forest
Trained Random Forest model instance
Model Configuration:
n_estimators: 100 (configurable)
max_depth: 10
min_samples_split: 10
min_samples_leaf: 5
random_state: 42
n_jobs: -1 (use all CPU cores)
Example:
rf_model = predictor.train_random_forest( n_estimators = 150 )
Output:
🌲 Training Random Forest...
Training accuracy: 0.892
Test accuracy: 0.847
train_xgboost
model = predictor.train_xgboost()
Trains an XGBoost classifier for winner prediction.
Trained XGBoost model instance
Model Configuration:
n_estimators: 100
max_depth: 6
learning_rate: 0.1
random_state: 42
eval_metric: ‘logloss’
Example:
xgb_model = predictor.train_xgboost()
Output:
🚀 Training XGBoost...
Training accuracy: 0.905
Test accuracy: 0.853
evaluate
Performs comprehensive model evaluation including classification reports and confusion matrices. Saves visualizations to ./models/confusion_matrices.png.
Example:
Output:
📊 Model Evaluation:
============================================================
🌲 RANDOM FOREST:
precision recall f1-score support
Not Top-3 0.93 0.95 0.94 456
Top-3 0.78 0.71 0.74 114
accuracy 0.90 570
🚀 XGBOOST:
precision recall f1-score support
Not Top-3 0.94 0.96 0.95 456
Top-3 0.82 0.75 0.78 114
accuracy 0.92 570
✓ Confusion matrices saved to ./models/confusion_matrices.png
feature_importance
predictor.feature_importance()
Analyzes and visualizes feature importance for both models. Saves plots to ./models/feature_importance.png.
Example:
predictor.feature_importance()
Output:
📈 Feature Importance:
✓ Feature importance saved to ./models/feature_importance.png
🔝 Top 5 Features (Random Forest):
GridPosition : 0.2847
Driver_AvgPosition : 0.1653
Team_AvgPosition : 0.1205
Driver_TotalWins : 0.0892
Driver_Last5_AvgPosition : 0.0734
save_models
predictor.save_models( output_dir = './models/saved_models' )
Saves trained models and feature columns to disk using joblib.
output_dir
string
default: "./models/saved_models"
Directory path where models will be saved
Saved Files:
winner_predictor_rf.pkl: Random Forest model
winner_predictor_xgb.pkl: XGBoost model
feature_columns.pkl: List of feature column names
Example:
predictor.save_models( './models/production' )
Output:
💾 Saving models...
✓ Models saved successfully
predict_race
prediction = predictor.predict_race(race_features)
Predicts race outcome for new race data using an ensemble of both models.
DataFrame containing feature values for the race to predict
Dictionary containing prediction results with the following keys:
rf_probability (float): Random Forest probability for top-K finish
xgb_probability (float): XGBoost probability for top-K finish
ensemble_probability (float): Average probability from both models
prediction (int): Binary prediction (1 = top-K, 0 = not top-K)
Example:
# Create sample race features
race_data = pd.DataFrame([{
'GridPosition' : 3 ,
'Driver_AvgPosition' : 4.2 ,
'Driver_AvgPoints' : 12.5 ,
'Driver_TotalWins' : 5 ,
'Team_AvgPosition' : 3.1 ,
# ... other features
}])
prediction = predictor.predict_race(race_data)
print ( f "Ensemble probability: { prediction[ 'ensemble_probability' ] :.2%} " )
print ( f "Prediction: { 'Top-3' if prediction[ 'prediction' ] else 'Not Top-3' } " )
Output:
{
'rf_probability' : 0.78 ,
'xgb_probability' : 0.82 ,
'ensemble_probability' : 0.80 ,
'prediction' : 1
}
Complete Usage Example
Full Training Pipeline
Load and Predict
from winner_predictor import WinnerPredictor
# Initialize predictor
predictor = WinnerPredictor()
# Load and prepare data
predictor.load_data( './data/processed/race_features.csv' )
predictor.prepare_features()
predictor.create_target( top_k = 3 )
predictor.split_data( test_size = 0.2 )
# Train models
predictor.train_random_forest( n_estimators = 100 )
predictor.train_xgboost()
# Evaluate performance
predictor.evaluate()
predictor.feature_importance()
# Save for production
predictor.save_models( './models/saved_models' )
Accuracy Metrics
Key Features
Training Time
Typical Performance:
Random Forest Test Accuracy: ~85-90%
XGBoost Test Accuracy: ~86-92%
Ensemble Performance: ~87-93%
Class Balance:
Top-3 finishes: ~30% of data
Non-podium: ~70% of data
Most Important Features:
GridPosition (28-30% importance)
Driver_AvgPosition (15-17%)
Team_AvgPosition (11-13%)
Driver_TotalWins (8-10%)
Driver_Last5_AvgPosition (7-9%)
Grid position is the strongest predictor of race outcome. Typical Training Duration:
Data Loading: ~1-2 seconds
Feature Engineering: ~2-3 seconds
Random Forest: ~5-10 seconds
XGBoost: ~8-15 seconds
Total Pipeline: ~20-30 seconds
Hardware: 4-core CPU, 8GB RAM
Notes
The model uses time-based splitting to prevent data leakage - training data is always from earlier races than test data
Missing values are filled with sensible defaults (0 for most features, 10.0 for positions)
Both models are trained with fixed random seeds (random_state=42) for reproducibility
The ensemble prediction averages probabilities from both Random Forest and XGBoost models
Classification threshold is 0.5 for binary predictions