System Architecture

System Overview

The Formula 1 ML Prediction System is a comprehensive race prediction platform that combines historical race data, machine learning models, and real-time simulation to predict race outcomes with high accuracy.

85.9% Accuracy

Achieved through ensemble learning and feature engineering

Real-time Simulation

Lap-by-lap race engine with pit stops and safety cars

Weather-Aware

Adapts predictions based on weather conditions

20+ Features

Driver, team, circuit, weather, and tire strategy features

Architecture Components

The system consists of five main components that work together to deliver accurate race predictions:

1. Data Collection Layer

File: collect_working.py Collects historical F1 data using the FastF1 API:

Race results (positions, points, grid positions)
Driver and team information
Event metadata (year, round, event name)
Multiple seasons (2023-2024)

session = fastf1.get_session(year, round_num, 'R')
session.load()
results = session.results

2. Feature Engineering Pipeline

Files: feature_engineering.py, feature_engineering_v2.py Transforms raw data into ML-ready features:

Driver Features
Team Features
Weather Features
Tire Features

Average position (historical)
Average points per race
Total wins and podiums
DNF rate
Circuit-specific performance

3. ML Model Layer

Files: winner_predictor.py, train_model_v2.py Two ensemble models working together:

Random Forest

150 estimators
Max depth: 12
Min samples split: 8
Primary prediction model

XGBoost

100 estimators
Max depth: 6
Learning rate: 0.1
Gradient boosting ensemble

4. Race Simulation Engine

File: race_engine.py Lap-by-lap race simulator with realistic physics:

RACE_LAPS = 50
PIT_LOSS = 22.0  # seconds

TIRE_DEG = {
    "SOFT":   0.085,  # seconds per lap
    "MEDIUM": 0.050,
    "HARD":   0.028
}

Features:

Real lap times with tire degradation
Pit stop strategy (undercut/overcut)
Safety car periods
Weather changes mid-race
DNFs and mechanical failures
Driver skill ratings and team performance

5. Web Application

File: app.py Flask-based web interface with multiple features:

Winner prediction with probabilities
Weather impact analysis
Tire strategy optimizer
Feature importance visualization
Driver head-to-head comparison
Full race simulation
2026 season prediction
Lap-by-lap race viewer

Data Flow Architecture

The system uses time-based data splitting to prevent data leakage. Training uses only historical data that occurred before the test races.

Component Interaction

Prediction Flow

User Input → Grid position, weather, tire, circuit type
Feature Creation → Convert inputs to feature vector
Model Ensemble → RF and XGBoost predict independently
Probability Averaging → Average predictions for final result
Response → Return probability and insights

Simulation Flow

Setup → Initialize 20 drivers with skill ratings
Qualifying → Determine starting grid based on driver/team performance
Strategy Assignment → Assign tire strategies based on weather
Lap Loop → Simulate 50 laps with:
- Tire degradation
- Pit stops
- Safety cars
- Weather changes
- DNF events
Results → Generate final standings and statistics

Technology Stack

Backend & ML

Python 3.8+: Core language
FastF1: F1 data API
scikit-learn: Random Forest classifier
XGBoost: Gradient boosting
pandas: Data manipulation
numpy: Numerical computing
Flask: Web framework
joblib: Model serialization

Frontend

HTML/CSS/JavaScript: UI
Plotly.js: Interactive charts
Custom CSS: Responsive design

Data Storage

CSV files: Raw and processed data
JSON: Model outputs and predictions
Pickle files: Serialized ML models

Model Files

The system maintains two model versions: V1 Models (Basic):

winner_predictor_rf.pkl - Random Forest
winner_predictor_xgb.pkl - XGBoost
feature_columns.pkl - Feature list

V2 Models (Enhanced):

winner_predictor_v2.pkl - Enhanced RF with weather/tire features
feature_columns_v2.pkl - Extended feature list

The app automatically falls back to V1 models if V2 is unavailable (see app.py:10-20).

Performance Characteristics

Training Time: ~5-10 seconds on typical dataset
Prediction Time: Less than 50ms per prediction
Race Simulation: ~2-3 seconds for 50 laps
Model Size: ~2-5 MB (pickled)
Memory Usage: ~200-300 MB during training

Scalability Considerations

The architecture supports:

Adding new features without retraining (forward compatibility)
Multiple model versions running concurrently
Batch predictions for entire race grids
Season-long simulations (24 races)

Get Started

Core Concepts

Guides

System Overview

85.9% Accuracy

Real-time Simulation

Weather-Aware

20+ Features

Architecture Components

1. Data Collection Layer

2. Feature Engineering Pipeline

3. ML Model Layer

Random Forest

XGBoost

4. Race Simulation Engine

5. Web Application

Data Flow Architecture

Component Interaction

Prediction Flow

Simulation Flow

Technology Stack

Model Files

Performance Characteristics

Scalability Considerations

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​System Overview

85.9% Accuracy

Real-time Simulation

Weather-Aware

20+ Features

​Architecture Components

​1. Data Collection Layer

​2. Feature Engineering Pipeline

​3. ML Model Layer

Random Forest

XGBoost

​4. Race Simulation Engine

​5. Web Application

​Data Flow Architecture

​Component Interaction

​Prediction Flow

​Simulation Flow

​Technology Stack

​Model Files

​Performance Characteristics

​Scalability Considerations

Build docs developers (and LLMs) love

System Overview

Architecture Components

1. Data Collection Layer

2. Feature Engineering Pipeline

3. ML Model Layer

4. Race Simulation Engine

5. Web Application

Data Flow Architecture

Component Interaction

Prediction Flow

Simulation Flow

Technology Stack

Model Files

Performance Characteristics

Scalability Considerations