Implementation Guide: This guide demonstrates reference implementations and best practices for training AQI prediction models. Code examples show typical patterns used in production ML systems.
Overview
This guide covers training AQI prediction models using different machine learning approaches, from traditional algorithms to deep learning models. The examples demonstrate how to configure training pipelines, optimize hyperparameters, and save trained models for deployment.
Prerequisites
Before training models, ensure you have completed the data preparation steps and have training, validation, and test datasets ready.
Training Approaches
AQI Predictor supports multiple model architectures:
Gradient Boosting : XGBoost, LightGBM (recommended for structured data)
Random Forests : Robust ensemble method
Neural Networks : Deep learning for complex patterns
LSTM : Recurrent networks for temporal sequences
Quick Start Training
Load Prepared Data
Load the processed datasets created during data preparation. import pandas as pd
import numpy as np
import joblib
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Load datasets
train_df = pd.read_parquet( 'data/processed/train.parquet' )
val_df = pd.read_parquet( 'data/processed/val.parquet' )
test_df = pd.read_parquet( 'data/processed/test.parquet' )
# Separate features and target
X_train = train_df.drop( 'aqi' , axis = 1 )
y_train = train_df[ 'aqi' ]
X_val = val_df.drop( 'aqi' , axis = 1 )
y_val = val_df[ 'aqi' ]
X_test = test_df.drop( 'aqi' , axis = 1 )
y_test = test_df[ 'aqi' ]
print ( f "Training samples: { len (X_train) } " )
print ( f "Validation samples: { len (X_val) } " )
print ( f "Features: { len (X_train.columns) } " )
Train Baseline Model
Start with a simple baseline to establish performance expectations. from sklearn.ensemble import RandomForestRegressor
# Simple Random Forest baseline
baseline_model = RandomForestRegressor(
n_estimators = 100 ,
max_depth = 10 ,
random_state = 42 ,
n_jobs =- 1
)
print ( "Training baseline model..." )
baseline_model.fit(X_train, y_train)
# Evaluate on validation set
val_pred = baseline_model.predict(X_val)
val_rmse = np.sqrt(mean_squared_error(y_val, val_pred))
val_mae = mean_absolute_error(y_val, val_pred)
val_r2 = r2_score(y_val, val_pred)
print ( f " \n Baseline Performance:" )
print ( f " RMSE: { val_rmse :.2f} " )
print ( f " MAE: { val_mae :.2f} " )
print ( f " R²: { val_r2 :.3f} " )
A baseline model helps you understand if more complex models provide meaningful improvements. Aim for RMSE < 15 for good AQI predictions.
Advanced Training Methods
XGBoost with Hyperparameter Tuning
XGBoost typically provides the best performance for AQI prediction.
Basic XGBoost
Hyperparameter Tuning
Cross-Validation
import xgboost as xgb
# Prepare DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label = y_train)
dval = xgb.DMatrix(X_val, label = y_val)
# Training parameters
params = {
'objective' : 'reg:squarederror' ,
'max_depth' : 8 ,
'learning_rate' : 0.05 ,
'subsample' : 0.8 ,
'colsample_bytree' : 0.8 ,
'min_child_weight' : 3 ,
'gamma' : 0.1 ,
'reg_alpha' : 0.1 ,
'reg_lambda' : 1.0 ,
'seed' : 42
}
# Train with early stopping
evals = [(dtrain, 'train' ), (dval, 'val' )]
model = xgb.train(
params,
dtrain,
num_boost_round = 1000 ,
evals = evals,
early_stopping_rounds = 50 ,
verbose_eval = 50
)
print ( f "Best iteration: { model.best_iteration } " )
print ( f "Best score: { model.best_score :.4f} " )
LightGBM for Fast Training
LightGBM is faster than XGBoost and often achieves similar performance.
import lightgbm as lgb
# Prepare datasets
train_data = lgb.Dataset(X_train, label = y_train)
val_data = lgb.Dataset(X_val, label = y_val, reference = train_data)
# LightGBM parameters
params = {
'objective' : 'regression' ,
'metric' : 'rmse' ,
'boosting_type' : 'gbdt' ,
'num_leaves' : 63 ,
'learning_rate' : 0.05 ,
'feature_fraction' : 0.8 ,
'bagging_fraction' : 0.8 ,
'bagging_freq' : 5 ,
'min_child_samples' : 20 ,
'lambda_l1' : 0.1 ,
'lambda_l2' : 1.0 ,
'verbose' : - 1 ,
'seed' : 42
}
# Train model
print ( "Training LightGBM model..." )
lgb_model = lgb.train(
params,
train_data,
num_boost_round = 1000 ,
valid_sets = [train_data, val_data],
valid_names = [ 'train' , 'val' ],
callbacks = [
lgb.early_stopping( stopping_rounds = 50 ),
lgb.log_evaluation( period = 50 )
]
)
print ( f " \n Best iteration: { lgb_model.best_iteration } " )
print ( f "Best score: { lgb_model.best_score[ 'val' ][ 'rmse' ] :.4f} " )
# Feature importance
importance = pd.DataFrame({
'feature' : X_train.columns,
'importance' : lgb_model.feature_importance()
}).sort_values( 'importance' , ascending = False )
print ( " \n Top 10 Important Features:" )
print (importance.head( 10 ))
Deep Learning with Neural Networks
For complex non-linear patterns, neural networks can be effective.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks
# Build neural network
def build_nn_model ( input_dim , hidden_units = [ 256 , 128 , 64 ]):
model = keras.Sequential([
layers.Input( shape = (input_dim,)),
layers.BatchNormalization(),
layers.Dense(hidden_units[ 0 ], activation = 'relu' ),
layers.Dropout( 0.3 ),
layers.BatchNormalization(),
layers.Dense(hidden_units[ 1 ], activation = 'relu' ),
layers.Dropout( 0.2 ),
layers.BatchNormalization(),
layers.Dense(hidden_units[ 2 ], activation = 'relu' ),
layers.Dropout( 0.1 ),
layers.Dense( 1 , activation = 'linear' )
])
model.compile(
optimizer = keras.optimizers.Adam( learning_rate = 0.001 ),
loss = 'mse' ,
metrics = [ 'mae' ]
)
return model
# Create and train model
nn_model = build_nn_model( input_dim = X_train.shape[ 1 ])
print (nn_model.summary())
# Callbacks
early_stop = callbacks.EarlyStopping(
monitor = 'val_loss' ,
patience = 20 ,
restore_best_weights = True
)
reduce_lr = callbacks.ReduceLROnPlateau(
monitor = 'val_loss' ,
factor = 0.5 ,
patience = 10 ,
min_lr = 1e-6
)
model_checkpoint = callbacks.ModelCheckpoint(
'models/nn_model_best.keras' ,
monitor = 'val_loss' ,
save_best_only = True
)
# Train
history = nn_model.fit(
X_train, y_train,
validation_data = (X_val, y_val),
epochs = 200 ,
batch_size = 256 ,
callbacks = [early_stop, reduce_lr, model_checkpoint],
verbose = 1
)
print ( f " \n Best validation loss: { min (history.history[ 'val_loss' ]) :.4f} " )
LSTM for Temporal Sequences
LSTM networks can capture long-term temporal dependencies.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM , Dense, Dropout
def create_sequences ( X , y , sequence_length = 24 ):
"""Create sequences for LSTM input"""
X_seq, y_seq = [], []
for i in range ( len (X) - sequence_length):
X_seq.append(X[i:i + sequence_length])
y_seq.append(y[i + sequence_length])
return np.array(X_seq), np.array(y_seq)
# Prepare sequences (24 hours of history)
sequence_length = 24
X_train_seq, y_train_seq = create_sequences(X_train.values, y_train.values, sequence_length)
X_val_seq, y_val_seq = create_sequences(X_val.values, y_val.values, sequence_length)
print ( f "Sequence shape: { X_train_seq.shape } " )
# Build LSTM model
lstm_model = Sequential([
LSTM( 128 , return_sequences = True , input_shape = (sequence_length, X_train.shape[ 1 ])),
Dropout( 0.3 ),
LSTM( 64 , return_sequences = False ),
Dropout( 0.2 ),
Dense( 32 , activation = 'relu' ),
Dense( 1 )
])
lstm_model.compile(
optimizer = 'adam' ,
loss = 'mse' ,
metrics = [ 'mae' ]
)
print (lstm_model.summary())
# Train LSTM
history = lstm_model.fit(
X_train_seq, y_train_seq,
validation_data = (X_val_seq, y_val_seq),
epochs = 100 ,
batch_size = 128 ,
callbacks = [early_stop, reduce_lr],
verbose = 1
)
Model Ensemble
Combine multiple models for improved predictions.
class AQIEnsemble :
def __init__ ( self , models , weights = None ):
self .models = models
self .weights = weights if weights else [ 1.0 / len (models)] * len (models)
def predict ( self , X ):
predictions = []
for model, weight in zip ( self .models, self .weights):
if hasattr (model, 'predict_proba' ):
pred = model.predict(X)
else :
pred = model.predict(xgb.DMatrix(X) if 'xgb' in str ( type (model)) else X)
predictions.append(pred * weight)
return np.sum(predictions, axis = 0 )
# Create ensemble
ensemble = AQIEnsemble(
models = [xgb_model, lgb_model, baseline_model],
weights = [ 0.5 , 0.3 , 0.2 ]
)
# Evaluate ensemble
ensemble_pred = ensemble.predict(X_val)
ensemble_rmse = np.sqrt(mean_squared_error(y_val, ensemble_pred))
ensemble_mae = mean_absolute_error(y_val, ensemble_pred)
print ( f " \n Ensemble Performance:" )
print ( f " RMSE: { ensemble_rmse :.2f} " )
print ( f " MAE: { ensemble_mae :.2f} " )
Save Trained Models
Save Model Artifacts
import joblib
import os
# Create models directory
os.makedirs( 'models' , exist_ok = True )
# Save XGBoost
xgb_model.save_model( 'models/xgboost_model.json' )
# Save LightGBM
lgb_model.save_model( 'models/lightgbm_model.txt' )
# Save sklearn models
joblib.dump(baseline_model, 'models/random_forest_model.pkl' )
# Save neural network
nn_model.save( 'models/neural_network_model.keras' )
# Save LSTM
lstm_model.save( 'models/lstm_model.keras' )
print ( "All models saved successfully!" )
Save Model Metadata
import json
from datetime import datetime
# Model metadata
metadata = {
'training_date' : datetime.now().isoformat(),
'train_samples' : len (X_train),
'val_samples' : len (X_val),
'features' : X_train.columns.tolist(),
'models' : {
'xgboost' : {
'rmse' : float (np.sqrt(mean_squared_error(y_val, xgb_model.predict(xgb.DMatrix(X_val))))),
'mae' : float (mean_absolute_error(y_val, xgb_model.predict(xgb.DMatrix(X_val)))),
'best_iteration' : int (xgb_model.best_iteration)
},
'lightgbm' : {
'rmse' : lgb_model.best_score[ 'val' ][ 'rmse' ],
'best_iteration' : int (lgb_model.best_iteration)
}
}
}
with open ( 'models/metadata.json' , 'w' ) as f:
json.dump(metadata, f, indent = 2 )
print ( "Metadata saved!" )
Always version your models and track their performance metrics. Use tools like MLflow or Weights & Biases for experiment tracking in production environments.
Training Best Practices
Start simple : Begin with baseline models before trying complex architectures
Monitor overfitting : Always validate on hold-out data
Feature importance : Analyze which features drive predictions
Early stopping : Prevent overfitting by stopping when validation metrics plateau
Regularization : Use L1/L2 regularization and dropout to improve generalization
Cross-validation : Use time-series CV for robust performance estimates
Ensemble : Combine multiple models for better predictions
Next Steps
After training your models: