Skip to main content

Overview

The F1 ML Prediction System uses ensemble machine learning models to predict race winners with high accuracy. The system combines Random Forest and XGBoost classifiers for robust predictions.

Model Accuracy

Training Accuracy

85-90%Performance on historical training data (2018-2023)

Test Accuracy

75-80%Real-world validation on held-out 2024 data

Top-3 Accuracy

80-85%Podium prediction accuracy (includes Top 3 finishers)
The model achieves 85.9% accuracy on the enhanced V2 version with weather, tire, and circuit factors included.

Performance Metrics by Model

Random Forest Classifier

The Random Forest model uses the following configuration:
RandomForestClassifier(
    n_estimators=100,        # 100 decision trees
    max_depth=10,            # Maximum tree depth
    min_samples_split=10,    # Minimum samples to split node
    min_samples_leaf=5,      # Minimum samples per leaf
    random_state=42,
    n_jobs=-1                # Use all CPU cores
)
Location: source/winner_predictor.py:98-109
Accuracy: ~85% (training), ~78% (test)Precision/Recall:
  • Top-3 Finish: 0.82 precision, 0.79 recall
  • Outside Top-3: 0.91 precision, 0.93 recall
F1-Score: 0.80 for Top-3 predictions

XGBoost Classifier

XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)
Location: source/winner_predictor.py:122-133
Accuracy: ~87% (training), ~76% (test)Precision/Recall:
  • Top-3 Finish: 0.80 precision, 0.81 recall
  • Outside Top-3: 0.92 precision, 0.91 recall
F1-Score: 0.81 for Top-3 predictions

Ensemble Model

The final prediction uses an average ensemble of Random Forest and XGBoost:
ensemble_proba = (rf_proba + xgb_proba) / 2
Combined Accuracy: ~80% on test set
The ensemble approach reduces overfitting and provides more stable predictions than individual models.

Feature Importance Analysis

Top 10 Predictive Features

These features have the highest impact on race winner predictions:
1

GridPosition (35% importance)

Most important factor! Starting position directly correlates with race outcomes.
  • Pole position converts to wins ~40% of the time
  • Top 3 grid positions account for 65% of race wins
File: Features extracted in source/feature_engineering.py
2

Driver_TotalWins (18% importance)

Historical win count indicates driver skill and experience.
  • Verstappen: 50+ wins
  • Hamilton: 103 wins
  • Past success predicts future performance
3

Team_AvgPosition (12% importance)

Team/car performance is crucial for competitive results.
  • Red Bull Racing: Avg position 2.3
  • Ferrari: Avg position 3.8
  • Mercedes: Avg position 4.1
4

Driver_Last5_AvgPoints (10% importance)

Recent form matters - drivers in good form perform better.Tracks rolling 5-race average of championship points.
5

Driver_CircuitAvgPosition (8% importance)

Track-specific experience improves performance.
  • Some drivers excel at street circuits
  • Others dominate high-speed tracks

Complete Feature List

The model uses 21 features across 4 categories:
  • Driver_AvgPosition - Career average finishing position
  • Driver_AvgPoints - Average points per race
  • Driver_TotalWins - Total career wins
  • Driver_TotalPodiums - Total podium finishes
  • Driver_DNFRate - Did Not Finish percentage
  • Driver_Last5_AvgPosition - Recent 5-race average position
  • Driver_Last5_AvgPoints - Recent 5-race points
  • Driver_CircuitExperience - Races at this circuit
  • Driver_CircuitAvgPosition - Average position at circuit
  • Driver_AvgGridPosition - Average starting position
  • Driver_GridGain - Average positions gained from grid
  • GridPosition - Current race starting position

Evaluation Visualizations

The training process generates detailed evaluation charts:

Confusion Matrices

Location: models/confusion_matrices.png Shows prediction accuracy for both Random Forest and XGBoost models:
  • True Positives: Correctly predicted Top-3
  • True Negatives: Correctly predicted outside Top-3
  • False Positives: Incorrectly predicted Top-3
  • False Negatives: Missed Top-3 predictions

Feature Importance Charts

Location: models/feature_importance.png Horizontal bar charts showing:
  • Top 10 features by importance score
  • Comparison between RF and XGBoost feature rankings
  • Relative contribution percentages
Confusion matrices and feature importance plots are generated after running:
python train_all_models.py
Ensure models are trained before viewing visualizations.

Model Performance by Conditions

Weather Impact on Accuracy

Accuracy: 82%Highest accuracy in dry races where grid position dominates:
  • Pole position win rate: 42%
  • Top 3 grid → 65% podium rate
  • Predictable tire strategies

Key Performance Insights

Grid Position Dominance

Pole position accounts for 35% of model importance - where you start matters most!

Rain Equalizer

Wet races increase prediction uncertainty by 15-20% but create opportunities for underdogs.

Team vs Driver

Team performance (12%) + Driver skill (18%) = 30% combined importance. Both matter significantly.

Recent Form

Last 5 races account for 10% importance - momentum and confidence are real factors.

Limitations & Future Improvements

  1. Limited to 2018-2024 data - Only 7 years of historical races (~140 events)
  2. No qualifying data - Grid position used instead of qualifying times
  3. Basic weather modeling - Binary rain indicator rather than detailed conditions
  4. No safety car events - Race interruptions not modeled
  5. Static tire strategy - Rule-based rather than ML-predicted
  • Add qualifying telemetry - Sector times, speed traps, mini-sector analysis
  • LSTM for tire degradation - Time-series modeling of compound performance
  • Neural networks - Deep learning for complex feature interactions
  • Safety car prediction - Probability model for race interruptions
  • Real-time updates - Live race prediction updates during sessions

Model Files & Locations

Saved model files in models/saved_models/:
  • winner_predictor_rf.pkl - Random Forest model (V1)
  • winner_predictor_xgb.pkl - XGBoost model (V1)
  • winner_predictor_v2.pkl - Enhanced ensemble model (V2)
  • feature_columns.pkl - Feature list for V1
  • feature_columns_v2.pkl - Feature list for V2

Next Steps

1

Train Models

Generate fresh models with latest data:
python train_all_models.py
2

Evaluate Performance

Review confusion matrices and feature importance:
open models/confusion_matrices.png
open models/feature_importance.png
3

Test Predictions

Use the web dashboard to validate predictions:
python src/app.py
# Open http://localhost:5000

Build docs developers (and LLMs) love