Skip to main content
The Premier League library includes powerful features for creating machine learning datasets. The create_dataset() method transforms raw match data into feature-engineered datasets perfect for training predictive models.

The create_dataset Method

This method generates a CSV file where each row represents a match with aggregated team statistics from previous games.
from premier_league import MatchStatistics

stats = MatchStatistics()

stats.create_dataset(
    output_path="ml_data/training_set.csv",
    rows_count=5000,
    lag=10,
    weights="exp",
    params=0.9
)

Parameters

output_path
str
required
The file path where the CSV will be saved. Creates parent directories if they don’t exist.
output_path="datasets/premier_league_2024.csv"
rows_count
int
default:"None"
Maximum number of rows to include. If None, includes all available data (currently max 17,520 matches).When specified, gets the last n rows after sorting by date.
rows_count=1000  # Most recent 1000 matches
lag
int
default:"10"
required
Number of previous games to use for calculating team statistics. Must be at least 1.
  • lag=10: Each row uses stats from the team’s past 10 games
  • lag=5: More recent but less stable statistics
  • lag=20: More stable but less responsive to form changes
lag=10  # Recommended for balanced results
weights
Literal['lin', 'exp']
default:"None"
Weighting strategy to give importance to more recent games.
  • None: All games weighted equally (simple average)
  • "lin": Linear weights (most recent game gets highest weight)
  • "exp": Exponential weights (requires params)
weights="exp"  # Exponential decay for recent emphasis
params
float
default:"None"
Parameter for exponential weighting. Required when weights="exp".Value between 0 and 1:
  • 0.9: Strong emphasis on recent games
  • 0.95: Moderate emphasis
  • 0.99: Slight emphasis
weights="exp"
params=0.9  # Decay factor

Understanding Lag

Lag determines how many previous games are used to calculate each team’s statistics for a given match.

How Lag Works

# Example: Arsenal vs Liverpool on 2024-02-15 with lag=10

# For Arsenal (home team):
# - Find Arsenal's last 10 games before 2024-02-15 in the same season
# - Calculate average stats across those 10 games
# - Prefix with "home_": home_xG, home_shots_total_FW, etc.

# For Liverpool (away team):
# - Find Liverpool's last 10 games before 2024-02-15 in the same season
# - Calculate average stats across those 10 games
# - Prefix with "away_": away_xG, away_shots_total_FW, etc.
If a team hasn’t played enough games in the season (less than lag games), that match is excluded from the dataset.

Choosing the Right Lag

Best for: Capturing current team form, momentum, recent tactical changesPros:
  • Highly responsive to recent performance
  • Captures “hot streaks” and slumps
  • Better for in-season predictions
Cons:
  • More volatile/noisy
  • Smaller sample size
  • Can overfit to recent anomalies
stats.create_dataset(
    output_path="short_term_form.csv",
    lag=5
)
Best for: Season-long trends, team quality assessment, stable predictionsPros:
  • Very stable statistics
  • Reduces noise from individual performances
  • Better represents “true” team strength
Cons:
  • Slow to adapt to changes
  • Requires more games (limits dataset size)
  • May miss tactical shifts or new signings
stats.create_dataset(
    output_path="long_term_trends.csv",
    lag=20
)

Weighting Strategies

Weighting allows you to emphasize recent games over older ones within the lag window.

No Weights (Default)

All games within the lag are weighted equally:
stats.create_dataset(
    output_path="equal_weights.csv",
    lag=10
    # No weights parameter = equal weighting
)

# Calculation for lag=10:
# avg_xG = (game1_xG + game2_xG + ... + game10_xG) / 10
Use when: You want a simple average without recency bias.

Linear Weights

More recent games get linearly higher weights:
stats.create_dataset(
    output_path="linear_weights.csv",
    lag=10,
    weights="lin"
)

# Weights for lag=10: [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
# Most recent game = weight 10
# Oldest game = weight 1

# Calculation:
# weighted_avg = (game1*10 + game2*9 + ... + game10*1) / (10+9+8+...+1)
# weighted_avg = (game1*10 + game2*9 + ... + game10*1) / 55
Use when: You want a moderate, predictable recency bias.

Exponential Weights

Recent games get exponentially higher weights:
stats.create_dataset(
    output_path="exp_weights.csv",
    lag=10,
    weights="exp",
    params=0.9  # Decay factor
)

# For params=0.9 and lag=10:
# Weights: [0.9^1, 0.9^2, 0.9^3, ..., 0.9^10]
# Weights: [0.90, 0.81, 0.73, 0.66, 0.59, 0.53, 0.48, 0.43, 0.39, 0.35]

# Calculation:
# weighted_avg = Σ(game_i * 0.9^i) / Σ(0.9^i)
Use when: You want strong emphasis on recent performance with smooth decay.

Comparing Weight Strategies

import pandas as pd

# Create datasets with different weighting strategies
stats = MatchStatistics()

# No weights
stats.create_dataset("comparison/no_weights.csv", lag=10)

# Linear weights
stats.create_dataset("comparison/linear.csv", lag=10, weights="lin")

# Strong exponential (0.85)
stats.create_dataset("comparison/exp_strong.csv", lag=10, weights="exp", params=0.85)

# Moderate exponential (0.9)
stats.create_dataset("comparison/exp_moderate.csv", lag=10, weights="exp", params=0.9)

# Weak exponential (0.95)
stats.create_dataset("comparison/exp_weak.csv", lag=10, weights="exp", params=0.95)

Dataset Structure

The generated CSV contains comprehensive features for each match:

Metadata Columns

# Game identification
game_id: str            # Unique match identifier
date: datetime          # Match date and time
season: str             # Season (e.g., "2023-2024")
match_week: int         # Week number in season

# Teams
home_team_id: str       # Home team ID
away_team_id: str       # Away team ID
home_team: str          # Home team name
away_team: str          # Away team name

# Pre-match form
home_points: int        # Home team's points before match
away_points: int        # Away team's points before match

Feature Columns (80+ per team)

Each team has 80+ aggregated statistics, prefixed with home_ or away_:

Expected Goals

home_xG: float          # Expected goals
home_xA: float          # Expected assists
home_xAG: float         # Expected assisted goals

Shooting (by position)

home_shots_total_FW: int        # Forward shots
home_shots_total_MF: int        # Midfielder shots
home_shots_total_DF: int        # Defender shots
home_shots_on_target_FW: int    # On-target shots by forwards
home_shots_on_target_MF: int    # On-target shots by midfielders
home_shots_on_target_DF: int    # On-target shots by defenders

Chance Creation

home_shot_creating_chances_FW: int   # Shot-creating actions by forwards
home_goal_creating_actions_FW: int   # Goal-creating actions by forwards
# ... similar for MF, DF

Passing

home_passes_completed_FW: int             # Completed passes by forwards
home_pass_completion_percentage_FW: float # Pass accuracy for forwards
home_key_passes: int                      # Passes leading to shots
home_passes_into_final_third: int         # Progressive passes
home_passes_into_penalty_area: int        # Passes into box
home_progressive_passes: int              # Forward-moving passes

Defense

home_tackles_won_FW: int               # Successful tackles by forwards
home_blocks_FW: int                    # Shot blocks by forwards
home_interceptions_FW: int             # Interceptions by forwards
home_clearances_FW: int                # Clearances by forwards
home_errors_leading_to_goal: int       # Defensive errors
# ... similar for MF, DF

Possession

home_possession_rate: int              # Ball possession %
home_touches_FW: int                   # Ball touches by forwards
home_touches_att_pen_area_FW: int      # Touches in opponent's box
home_take_ons_FW: int                  # Dribble attempts
home_successful_take_ons_FW: int       # Successful dribbles
home_carries_FW: int                   # Number of carries
home_total_carrying_distance_FW: int   # Distance carried (yards)
home_dispossessed_FW: int              # Times dispossessed

Goalkeeping

home_save_percentage: float            # Save success rate
home_saves: int                        # Total saves
home_PSxG: float                       # Post-shot xG faced
home_passes_completed_GK: int          # GK pass attempts
home_crosses_stopped: int              # Crosses intercepted

Discipline

home_yellow_card: int                  # Yellow cards
home_red_card: int                     # Red cards
home_fouls_committed_FW: int           # Fouls by forwards
home_fouls_drawn_FW: int               # Fouls won by forwards
home_offside_FW: int                   # Offside calls
# ... similar for MF, DF

Target Columns (at the end)

home_goals: int         # Actual goals scored by home team (TARGET)
away_goals: int         # Actual goals scored by away team (TARGET)
Target columns are placed at the end for ML convenience. Most libraries expect targets in the final columns.

Machine Learning Use Cases

Match Result Prediction

1

Generate dataset with balanced parameters

stats = MatchStatistics()

stats.create_dataset(
    output_path="ml/match_prediction.csv",
    lag=10,
    weights="exp",
    params=0.9
)
2

Load and prepare data

import pandas as pd
import numpy as np

df = pd.read_csv("ml/match_prediction.csv")

# Create result labels: 1=home win, 0=draw, -1=away win
df['result'] = np.where(
    df['home_goals'] > df['away_goals'], 1,
    np.where(df['home_goals'] < df['away_goals'], -1, 0)
)

# Feature columns (all except metadata and targets)
feature_cols = [col for col in df.columns 
                if col.startswith(('home_', 'away_')) 
                and col not in ['home_team', 'away_team', 'home_goals', 'away_goals']]

X = df[feature_cols]
y = df['result']
3

Train a model

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred, target_names=['Away Win', 'Draw', 'Home Win']))

Goal Scoring Prediction (Regression)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Load dataset
df = pd.read_csv("ml/match_prediction.csv")

# Features
feature_cols = [col for col in df.columns 
                if col.startswith(('home_', 'away_')) 
                and col not in ['home_team', 'away_team', 'home_goals', 'away_goals']]

X = df[feature_cols]
y_home = df['home_goals']  # Predict home goals
y_away = df['away_goals']  # Or predict away goals

# Split and train
X_train, X_test, y_train, y_test = train_test_split(
    X, y_home, test_size=0.2, random_state=42
)

model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f} goals")
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f} goals")

Over/Under Goals Market

import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("ml/match_prediction.csv")

# Create binary target: Over 2.5 goals?
df['over_2_5'] = (df['home_goals'] + df['away_goals'] > 2.5).astype(int)

# Use offensive features
offensive_features = [
    'home_xG', 'away_xG',
    'home_shots_total_FW', 'away_shots_total_FW',
    'home_shots_on_target_FW', 'away_shots_on_target_FW',
    'home_shot_creating_chances_FW', 'away_shot_creating_chances_FW',
    'home_goal_creating_actions_FW', 'away_goal_creating_actions_FW',
    'home_key_passes', 'away_key_passes'
]

X = df[offensive_features]
y = df['over_2_5']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Logistic regression for probabilities
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Get probabilities
probs = model.predict_proba(X_test)[:, 1]
print(f"Accuracy: {model.score(X_test, y_test):.3f}")

Feature Importance Analysis

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv("ml/match_prediction.csv")

# Prepare data
df['result'] = (df['home_goals'] > df['away_goals']).astype(int)
feature_cols = [col for col in df.columns 
                if col.startswith(('home_', 'away_')) 
                and col not in ['home_team', 'away_team', 'home_goals', 'away_goals']]

X = df[feature_cols]
y = df['result']

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Get feature importance
importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 20 features
plt.figure(figsize=(10, 8))
plt.barh(importance_df['feature'][:20], importance_df['importance'][:20])
plt.xlabel('Feature Importance')
plt.title('Top 20 Most Important Features for Match Prediction')
plt.tight_layout()
plt.savefig('feature_importance.png')

print(importance_df.head(20))

Advanced Dataset Configurations

Recent Form Only (Last 1000 Matches)

stats.create_dataset(
    output_path="recent_matches.csv",
    rows_count=1000,  # Most recent 1000 matches
    lag=5,           # Short-term form
    weights="exp",
    params=0.85      # Strong recency bias
)
Use case: In-season prediction models that prioritize current form.
stats.create_dataset(
    output_path="season_trends.csv",
    lag=20,          # Long-term average
    weights="lin"    # Moderate recency bias
)
Use case: Pre-season analysis, team strength assessment.

Ensemble Dataset Generation

Create multiple datasets with different parameters for ensemble models:
configs = [
    {"lag": 5, "weights": None, "file": "ensemble_lag5_noweight.csv"},
    {"lag": 10, "weights": "lin", "file": "ensemble_lag10_linear.csv"},
    {"lag": 10, "weights": "exp", "params": 0.9, "file": "ensemble_lag10_exp.csv"},
    {"lag": 15, "weights": "exp", "params": 0.95, "file": "ensemble_lag15_exp.csv"},
]

for config in configs:
    params = config.get('params')
    stats.create_dataset(
        output_path=f"ensemble/{config['file']}",
        lag=config['lag'],
        weights=config.get('weights'),
        params=params
    )
    print(f"Generated {config['file']}")

Best Practices

The dataset excludes matches where teams haven’t played enough games:
import pandas as pd

df = pd.read_csv("training_data.csv")

# Check for any remaining NaN values
print(df.isnull().sum())

# Handle save_percentage which can be NaN
df['home_save_percentage'].fillna(0, inplace=True)
df['away_save_percentage'].fillna(0, inplace=True)
Don’t use random splits - use time-based splits:
df = pd.read_csv("training_data.csv")
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date')

# Train on earlier matches, test on later matches
split_date = '2024-01-01'
train = df[df['date'] < split_date]
test = df[df['date'] >= split_date]
Combine existing features for better predictions:
# Goal difference in form
df['home_avg_gd'] = df['home_xG'] - df['home_xA']
df['away_avg_gd'] = df['away_xG'] - df['away_xA']

# Total shots
df['home_total_shots'] = df['home_shots_total_FW'] + df['home_shots_total_MF'] + df['home_shots_total_DF']

# Relative strength
df['xG_difference'] = df['home_xG'] - df['away_xG']
df['points_difference'] = df['home_points'] - df['away_points']
Normalize features for certain algorithms:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Troubleshooting

Small Dataset Size

Issue: Dataset has fewer rows than expected Cause: High lag value excludes early-season games Solution: Reduce lag or use more seasons of data
# Check how many games are excluded
print(f"Total games in database: {stats.get_total_game_count()}")

# Generate dataset and check size
stats.create_dataset("test.csv", lag=10)
df = pd.read_csv("test.csv")
print(f"Games in dataset: {len(df)}")

Parameter Validation Errors

Error: ValueError: Exponential parameter must be specified for exponential Weights. Solution: Add params when using exponential weights
# Wrong
stats.create_dataset("data.csv", lag=10, weights="exp")

# Correct
stats.create_dataset("data.csv", lag=10, weights="exp", params=0.9)

Understanding Weight Calculations

If your weighted averages seem unexpected, verify the weight calculation:
lag = 10
params = 0.9

# Linear weights
lin_weights = [i for i in range(lag, 0, -1)]
print(f"Linear weights: {lin_weights}")
print(f"Sum: {sum(lin_weights)}")

# Exponential weights
exp_weights = [params ** k for k in range(1, lag + 1)]
print(f"Exponential weights: {[f'{w:.3f}' for w in exp_weights]}")
print(f"Sum: {sum(exp_weights):.3f}")
You now know how to create ML-ready datasets with the Premier League library!

Build docs developers (and LLMs) love