System Overview
The Formula 1 ML Prediction System is a comprehensive race prediction platform that combines historical race data, machine learning models, and real-time simulation to predict race outcomes with high accuracy.85.9% Accuracy
Achieved through ensemble learning and feature engineering
Real-time Simulation
Lap-by-lap race engine with pit stops and safety cars
Weather-Aware
Adapts predictions based on weather conditions
20+ Features
Driver, team, circuit, weather, and tire strategy features
Architecture Components
The system consists of five main components that work together to deliver accurate race predictions:1. Data Collection Layer
File:collect_working.py
Collects historical F1 data using the FastF1 API:
- Race results (positions, points, grid positions)
- Driver and team information
- Event metadata (year, round, event name)
- Multiple seasons (2023-2024)
2. Feature Engineering Pipeline
Files:feature_engineering.py, feature_engineering_v2.py
Transforms raw data into ML-ready features:
- Driver Features
- Team Features
- Weather Features
- Tire Features
- Average position (historical)
- Average points per race
- Total wins and podiums
- DNF rate
- Circuit-specific performance
3. ML Model Layer
Files:winner_predictor.py, train_model_v2.py
Two ensemble models working together:
Random Forest
- 150 estimators
- Max depth: 12
- Min samples split: 8
- Primary prediction model
XGBoost
- 100 estimators
- Max depth: 6
- Learning rate: 0.1
- Gradient boosting ensemble
4. Race Simulation Engine
File:race_engine.py
Lap-by-lap race simulator with realistic physics:
- Real lap times with tire degradation
- Pit stop strategy (undercut/overcut)
- Safety car periods
- Weather changes mid-race
- DNFs and mechanical failures
- Driver skill ratings and team performance
5. Web Application
File:app.py
Flask-based web interface with multiple features:
- Winner prediction with probabilities
- Weather impact analysis
- Tire strategy optimizer
- Feature importance visualization
- Driver head-to-head comparison
- Full race simulation
- 2026 season prediction
- Lap-by-lap race viewer
Data Flow Architecture
The system uses time-based data splitting to prevent data leakage. Training uses only historical data that occurred before the test races.
Component Interaction
Prediction Flow
- User Input → Grid position, weather, tire, circuit type
- Feature Creation → Convert inputs to feature vector
- Model Ensemble → RF and XGBoost predict independently
- Probability Averaging → Average predictions for final result
- Response → Return probability and insights
Simulation Flow
- Setup → Initialize 20 drivers with skill ratings
- Qualifying → Determine starting grid based on driver/team performance
- Strategy Assignment → Assign tire strategies based on weather
- Lap Loop → Simulate 50 laps with:
- Tire degradation
- Pit stops
- Safety cars
- Weather changes
- DNF events
- Results → Generate final standings and statistics
Technology Stack
Backend & ML
Backend & ML
- Python 3.8+: Core language
- FastF1: F1 data API
- scikit-learn: Random Forest classifier
- XGBoost: Gradient boosting
- pandas: Data manipulation
- numpy: Numerical computing
- Flask: Web framework
- joblib: Model serialization
Frontend
Frontend
- HTML/CSS/JavaScript: UI
- Plotly.js: Interactive charts
- Custom CSS: Responsive design
Data Storage
Data Storage
- CSV files: Raw and processed data
- JSON: Model outputs and predictions
- Pickle files: Serialized ML models
Model Files
The system maintains two model versions: V1 Models (Basic):winner_predictor_rf.pkl- Random Forestwinner_predictor_xgb.pkl- XGBoostfeature_columns.pkl- Feature list
winner_predictor_v2.pkl- Enhanced RF with weather/tire featuresfeature_columns_v2.pkl- Extended feature list
The app automatically falls back to V1 models if V2 is unavailable (see app.py:10-20).
Performance Characteristics
- Training Time: ~5-10 seconds on typical dataset
- Prediction Time: Less than 50ms per prediction
- Race Simulation: ~2-3 seconds for 50 laps
- Model Size: ~2-5 MB (pickled)
- Memory Usage: ~200-300 MB during training
Scalability Considerations
The architecture supports:- Adding new features without retraining (forward compatibility)
- Multiple model versions running concurrently
- Batch predictions for entire race grids
- Season-long simulations (24 races)