This guide walks you through collecting F1 data, engineering features, training models, and making your first race prediction. The entire process takes approximately 2-4 hours, mostly for data collection.
The data collector gathers race results, lap times, pit stops, and weather data from 2018-2024:
src/data/f1_data_collector.py
import fastf1import pandas as pdfrom tqdm import tqdm# Enable FastF1 cachefastf1.Cache.enable_cache('./data/cache')# Collect data for seasons 2018-2024seasons = range(2018, 2025)all_results = []for year in seasons: print(f"\n📅 Collecting {year} season...") schedule = fastf1.get_event_schedule(year) for _, event in schedule.iterrows(): try: session = fastf1.get_session(year, event['EventName'], 'R') session.load() results = session.results results['Year'] = year results['EventName'] = event['EventName'] all_results.append(results) print(f" ✓ {event['EventName']}") except Exception as e: print(f" ✗ Error: {event['EventName']} - {e}")# Save collected datadf = pd.concat(all_results, ignore_index=True)df.to_csv('./data/raw/race_results.csv', index=False)print(f"\n✅ Collected {len(df)} race results")
Execute the data collector:
python src/data/f1_data_collector.py
This will take 2-4 hours! The script downloads telemetry data for 7 seasons (~150 races). Consider running it overnight.
Expected Output:
📅 Collecting 2024 season... ✓ Bahrain Grand Prix ✓ Saudi Arabian Grand Prix ✓ Australian Grand Prix ...✅ Collected 2,537 race results💾 Saved: data/raw/race_results.csv💾 Saved: data/raw/lap_times.csv (139,135 laps)💾 Saved: data/raw/pit_stops.csv (4,512 pit stops)💾 Saved: data/raw/weather.csv (127 records)
3
Verify Data Collection
Check that all data files were created successfully:
Race Results: 2,537 recordsLap Times: 139,135 lapsPit Stops: 4,512 stopsWeather: 127 racesSample Data: Year EventName DriverCode TeamName GridPosition Position Points0 2024 Bahrain Grand Prix VER Red Bull 1 1 25.01 2024 Bahrain Grand Prix PER Red Bull 2 2 18.02 2024 Bahrain Grand Prix SAI Ferrari 3 3 15.0
Build winner prediction models using Random Forest and XGBoost:
src/models/winner_predictor.py
from sklearn.ensemble import RandomForestClassifierfrom xgboost import XGBClassifierimport pandas as pdimport joblibclass WinnerPredictor: def __init__(self): self.rf_model = None self.xgb_model = None self.feature_columns = None def load_data(self, data_path='./data/processed/race_features.csv'): """Load engineered features""" self.data = pd.read_csv(data_path) self.data = self.data[self.data['Position'].notna()] print(f"✓ Loaded {len(self.data)} race results") return self.data def prepare_features(self): """Select features for modeling""" feature_cols = [ 'GridPosition', 'Driver_AvgPosition', 'Driver_AvgPoints', 'Driver_TotalWins', 'Driver_TotalPodiums' ] self.feature_columns = feature_cols print(f"✓ Using {len(feature_cols)} features") return feature_cols def create_target(self, top_k=3): """Create target: Top 3 finish (1) or not (0)""" self.data['IsTopK'] = (self.data['Position'] <= top_k).astype(int) print(f"✓ Top {top_k}: {self.data['IsTopK'].sum()} instances") return self.data['IsTopK'] def train_random_forest(self, n_estimators=100): """Train Random Forest classifier""" print("\n🌲 Training Random Forest...") self.rf_model = RandomForestClassifier( n_estimators=n_estimators, max_depth=10, random_state=42, n_jobs=-1 ) X = self.data[self.feature_columns] y = self.data['IsTopK'] self.rf_model.fit(X, y) accuracy = self.rf_model.score(X, y) print(f"✓ Training accuracy: {accuracy:.3f}") return self.rf_model def save_models(self): """Save trained models""" joblib.dump(self.rf_model, './models/saved_models/winner_predictor_rf.pkl') joblib.dump(self.feature_columns, './models/saved_models/feature_columns.pkl') print("\n💾 Models saved to ./models/saved_models/")
2
Execute Model Training
Create and run the master training script:
train_all_models.py
from src.models.winner_predictor import WinnerPredictorprint("="*70)print("🏎️ F1 MACHINE LEARNING - TRAINING PIPELINE")print("="*70)# Initialize predictorpredictor = WinnerPredictor()# Load and prepare datapredictor.load_data('./data/processed/race_features.csv')predictor.prepare_features()predictor.create_target(top_k=3)# Train modelspredictor.train_random_forest(n_estimators=100)# Save modelspredictor.save_models()print("\n🎉 Training complete!")print("\nNext: Run predictions with the trained model")
Run the training:
python train_all_models.py
Expected Output:
======================================================================🏎️ F1 MACHINE LEARNING - TRAINING PIPELINE======================================================================✓ Loaded 2,134 race results✓ Using 5 features✓ Top 3: 641 instances🌲 Training Random Forest...✓ Training accuracy: 0.847💾 Models saved to ./models/saved_models/🎉 Training complete!
import joblibimport pandas as pd# Load trained modelmodel = joblib.load('./models/saved_models/winner_predictor_rf.pkl')features = joblib.load('./models/saved_models/feature_columns.pkl')print("🏎️ F1 Race Winner Predictor\n")print("="*50)# Example: Predict Max Verstappen from pole positiondriver_data = { 'GridPosition': 1, # Pole position 'Driver_AvgPosition': 2.5, # Historical average 'Driver_AvgPoints': 18.2, # Average points 'Driver_TotalWins': 50, # Career wins 'Driver_TotalPodiums': 95 # Career podiums}# Create DataFrameX = pd.DataFrame([driver_data])[features]# Predictprobability = model.predict_proba(X)[0][1]prediction = model.predict(X)[0]print(f"Driver: Max Verstappen")print(f"Grid Position: P{driver_data['GridPosition']}")print(f"\nPrediction: {'Top 3 Finish ✅' if prediction else 'Outside Top 3'}")print(f"Confidence: {probability*100:.1f}%")print("="*50)
2
Run Your First Prediction
Execute the prediction:
python predict.py
Expected Output:
🏎️ F1 Race Winner Predictor==================================================Driver: Max VerstappenGrid Position: P1Prediction: Top 3 Finish ✅Confidence: 89.3%==================================================
3
Try Different Scenarios
Modify the driver data to test different scenarios:
# Scenario 1: Rookie driver starting from back of gridrookie_data = { 'GridPosition': 20, 'Driver_AvgPosition': 15.0, 'Driver_AvgPoints': 1.2, 'Driver_TotalWins': 0, 'Driver_TotalPodiums': 0}# Scenario 2: Mid-field driver with good formmidfield_data = { 'GridPosition': 8, 'Driver_AvgPosition': 9.5, 'Driver_AvgPoints': 5.8, 'Driver_TotalWins': 2, 'Driver_TotalPodiums': 12}
Grid position is the strongest predictor! Drivers starting in the top 3 have ~40% chance of finishing on the podium.
GridPosition (35% importance) - Starting position is crucial!
Driver_TotalWins (18%) - Past success predicts future performance
Driver_AvgPosition (12%) - Consistency matters
Driver_TotalPodiums (10%) - Experience on the podium
Grid Position Impact: Drivers starting from pole position have a ~40% chance of winning, while those starting outside the top 10 have less than 5% chance of a podium finish.
Congratulations! You’ve successfully:✅ Collected 7 years of F1 data
✅ Engineered predictive features
✅ Trained machine learning models
✅ Made your first race prediction
Review error messages carefully - they often indicate missing dependencies
Verify data files exist in data/raw/ directory
Ensure models are saved in models/saved_models/ before prediction
Remember: F1 races are inherently unpredictable! Even the best models can’t account for crashes, mechanical failures, or strategic surprises. Use predictions as guidance, not guarantees.