Skip to main content
The Premier League library uses a SQLite database to store match statistics, team information, and game data. This guide covers everything you need to know about managing your database.

Database Initialization

The database is automatically initialized when you create a MatchStatistics instance.
from premier_league import MatchStatistics

# Initialize with default settings
stats = MatchStatistics(
    db_filename="premier_league.db",
    db_directory="data"
)

Default Configuration

  • Database file: premier_league.db
  • Storage location: ./data/ (created in your current working directory)
  • Database type: SQLite
  • Initial leagues: Premier League, La Liga, Serie A, Bundesliga, Ligue 1, EFL Championship

Database Location

The database is stored in a directory relative to your current working directory:
import os

# Default location
db_path = os.path.join(os.getcwd(), "data", "premier_league.db")
# Example: /home/user/project/data/premier_league.db
The db_directory parameter creates the directory if it doesn’t exist, so you don’t need to create it manually.

Custom Database Location

You can specify a custom location for your database:
stats = MatchStatistics(
    db_filename="my_stats.db",
    db_directory="custom_data"
)
# Creates: ./custom_data/my_stats.db

Database Schema

The database consists of four main tables:

League Table

Stores information about football leagues:
  • id - Primary key
  • name - League name (e.g., “Premier League”)
  • up_to_date_season - Latest season with data (e.g., “2023-2024”)
  • up_to_date_match_week - Latest match week scraped

Team Table

Stores team information:
  • id - Unique team identifier (from FBRef)
  • name - Team name
  • league_id - Foreign key to league table

Game Table

Stores match results and metadata:
  • id - Unique game identifier
  • home_team_id, away_team_id - Foreign keys to team table
  • league_id - Foreign key to league table
  • home_goals, away_goals - Match score
  • home_team_points, away_team_points - Points before the match
  • date - Match date and time
  • match_week - Week number in the season
  • season - Season (e.g., “2023-2024”)

GameStats Table

Stores detailed statistics for each team in a game (80+ metrics):
  • id - Primary key
  • game_id - Foreign key to game table
  • team_id - Foreign key to team table
  • Expected Goals: xG, xA, xAG
  • Shooting: shots_total_FW/MF/DF, shots_on_target_FW/MF/DF
  • Passing: passes_completed_FW/MF/DF, pass_completion_percentage_FW/MF/DF, key_passes
  • Defense: tackles_won_FW/MF/DF, blocks_FW/MF/DF, interceptions_FW/MF/DF
  • Possession: possession_rate, touches_FW/MF/DF, carries_FW/MF/DF
  • Goalkeeping: save_percentage, saves, PSxG
  • Discipline: yellow_card, red_card, fouls_committed_FW/MF/DF
Statistics are split by position (FW=Forwards, MF=Midfielders, DF=Defenders, GK=Goalkeeper) to capture tactical nuances.

Updating the Database

1

Initialize MatchStatistics

Create a MatchStatistics instance with your database:
from premier_league import MatchStatistics

stats = MatchStatistics()
2

Run the update method

Call update_data_set() to fetch new match data:
stats.update_data_set()
This method:
  • Determines the current season automatically
  • Fetches all matches since the last update
  • Scrapes detailed statistics for each new game
  • Updates league tracking information
3

Wait for completion

The update process respects rate limits and may take several minutes:
Fetching Season Schedule: 100%|██████████| 15/15
Fetching Match Details: 100%|██████████| 127/127
Data Updated!
The update_data_set() method can take considerable time due to rate limiting (4 requests per second). A full season update may take 10-30 minutes.

How Updates Work

  1. Season Detection: Automatically determines the current season based on the current date
    • If current month >= August: Current season = {year}-{year+1}
    • Otherwise: Current season = {year-1}-{year}
  2. Gap Identification: Compares up_to_date_season for each league with the current season
  3. URL Generation: Creates URLs for all missing seasons and match weeks
  4. Duplicate Prevention: Filters out games already in the database by checking game IDs
  5. Data Scraping: Fetches match details including:
    • Team statistics by position
    • Expected goals (xG)
    • Passing, defensive, and possession metrics
    • Goalkeeper statistics
  6. League Update: Updates each league’s up_to_date_season and up_to_date_match_week

Example: Regular Updates

from premier_league import MatchStatistics
import schedule
import time

def update_database():
    """Update the database with latest match data"""
    stats = MatchStatistics()
    print("Starting database update...")
    stats.update_data_set()
    print("Database updated successfully!")

# Schedule weekly updates every Monday at 9 AM
schedule.every().monday.at("09:00").do(update_database)

while True:
    schedule.run_pending()
    time.sleep(3600)  # Check every hour

Querying the Database

The library provides convenient methods to query your data:

Get Total Game Count

total_games = stats.get_total_game_count()
print(f"Database contains {total_games} matches")
# Output: Database contains 15,847 matches

Get Games by Season

matches = stats.get_games_by_season(
    season="2023-2024",
    match_week=10
)

for match in matches:
    print(f"{match['home_team']['name']} {match['home_goals']}-{match['away_goals']} {match['away_team']['name']}")

Get Team Games

liverpool_games = stats.get_team_games("Liverpool")

for game in liverpool_games:
    print(f"Season {game['season']}, Week {game['match_week']}")
    print(f"{game['home_team']['name']} vs {game['away_team']['name']}")

Get Historical Data

from datetime import datetime

# Get last 5 games before a specific date
recent_games = stats.get_games_before_date(
    date=datetime(2024, 3, 15),
    limit=5,
    team="Arsenal"
)

for game in recent_games:
    print(f"{game['date']}: {game['home_team']['name']} vs {game['away_team']['name']}")

Database Maintenance

Checking Database Status

from premier_league import MatchStatistics

stats = MatchStatistics()

# Get all leagues and their update status
for league in stats.get_all_leagues():
    print(f"{league}: {stats.session.query(League).filter_by(name=league).first().up_to_date_season}")

Backing Up Your Database

import shutil
from datetime import datetime

# Create a timestamped backup
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
shutil.copy2(
    "data/premier_league.db",
    f"backups/premier_league_{timestamp}.db"
)

Resetting the Database

To start fresh, simply delete the database file:
import os

db_path = "data/premier_league.db"
if os.path.exists(db_path):
    os.remove(db_path)
    print("Database deleted. Will be recreated on next initialization.")

Best Practices

Keep your database in a dedicated directory and use environment variables for production:
import os

db_dir = os.getenv("PL_DB_DIR", "data")
db_file = os.getenv("PL_DB_FILE", "premier_league.db")

stats = MatchStatistics(
    db_filename=db_file,
    db_directory=db_dir
)
Don’t wait until you need data. Schedule weekly or daily updates:
# Update every Sunday evening after the match week
schedule.every().sunday.at("22:00").do(update_database)
The database grows as you add more data. Monitor its size:
import os

db_size = os.path.getsize("data/premier_league.db") / (1024 * 1024)
print(f"Database size: {db_size:.2f} MB")
If accessing the database from multiple processes, consider using SQLAlchemy’s connection pooling:
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

engine = create_engine(
    "sqlite:///data/premier_league.db",
    poolclass=QueuePool,
    pool_size=5,
    max_overflow=10
)

Troubleshooting

”All Data is up to Date!” Message

If you see this message when running update_data_set(), it means:
  • Your database contains all available matches
  • No new matches have been played since your last update
  • The current season hasn’t started yet (if checking in summer)

Database Locked Errors

SQLite databases can only handle one write operation at a time:
# Bad: Multiple simultaneous updates
thread1 = Thread(target=stats.update_data_set)
thread2 = Thread(target=stats.update_data_set)  # Will cause locks

# Good: Sequential updates
stats.update_data_set()
# Wait for completion, then:
stats2.update_data_set()

Missing Data After Update

If games are missing after an update:
  1. Check if the season format is correct (“YYYY-YYYY” with regular hyphen)
  2. Verify the league name matches exactly (use get_all_leagues())
  3. Ensure your internet connection is stable during the update
  4. Check the console for error messages during scraping

Advanced: Direct Database Access

For advanced queries, access the SQLAlchemy session directly:
from premier_league.data.models import Game, GameStats, Team, League
from sqlalchemy import func

# Get average goals per game by season
results = stats.session.query(
    Game.season,
    func.avg(Game.home_goals + Game.away_goals).label('avg_goals')
).group_by(Game.season).all()

for season, avg_goals in results:
    print(f"{season}: {avg_goals:.2f} goals per game")
You now know how to initialize, manage, update, and query your Premier League database!

Build docs developers (and LLMs) love