Skip to main content

Collection Overview

The NBA Statistics Data Platform employs an automated data collection architecture organized by season years. Scripts are scheduled to run daily, fetching data from multiple sources and organizing it into year-based directories.

Directory Structure

Data is organized hierarchically by season:
source/
├── 2014/
│   ├── defense/
│   ├── player_shooting/
│   ├── tracking/
│   └── playoffs/
├── 2015/
...
├── 2025/
│   ├── defense/
│   ├── player_shooting/
│   └── tracking/
Each season directory contains:
  • defense/: Defensive metrics (rim protection, DFG%, frequency stats)
  • player_shooting/: Shot quality data by defender distance
  • tracking/: NBA tracking stats (drives, touches, passes, catch & shoot)
  • playoffs/: Separate playoff versions of all datasets

Automation Workflow

The entire collection process is orchestrated through script.sh, which runs sequentially:

Daily Execution Script

#!/bin/bash
# Convert Jupyter notebooks to Python scripts
jupyter nbconvert --to script *.ipynb

# Execute collection scripts in sequence
python player_shooting.py
python scrape_shooting.py
python misc.py
python player_level.py

python team_shooting.py
python update_lebron.py
python dribble.py
python underground.py
python hustle.py
python new_tracking.py
python dates.py
python make_index2.py
python salary2.py
python price.py

# Commit and push updates
git add --all
git commit -m 'Daily Update'
git push origin master
1

Script Conversion

Converts any Jupyter notebooks (.ipynb) to executable Python scripts using nbconvert
2

Data Collection

Executes individual collection scripts in sequence:
  • Player shooting metrics by defender distance
  • Team shooting statistics
  • Tracking data (drives, touches, passes)
  • Hustle stats (deflections, loose balls, charges)
  • Defense metrics (rim protection, DFG%)
  • Salary and contract information
3

Index Generation

Runs make_index2.py to create master player index with ID mappings (Basketball Reference → NBA.com)
4

Version Control

Automatically commits all updated CSV files and pushes to the remote repository

Data Distribution

After collection, processed CSV files are distributed to consumer applications:
# Distribution script (copy.sh)
WEB_APP_DIR="../web_app/data"
DISCORD_DIR="../discord/data"

# Copy master files to both destinations
cp nba_salaries.csv $WEB_APP_DIR/
cp index_master.csv $WEB_APP_DIR/
cp hustle.csv $WEB_APP_DIR/
cp player_shooting.csv $WEB_APP_DIR/
# ... (50+ additional files)

Distribution Targets

Web Application

Main web interface consuming all datasets for interactive visualizations

Discord Bot

Subset of datasets for quick stat lookups via Discord commands

Player Sheets

Specialized lineups and advanced metrics module

Script Organization

Scripts are categorized by data type:
CategoryScriptsOutput Files
Player Shootingplayer_shooting.py, scrape_shooting.py, dribble.pyplayer_shooting.csv, dribbleshot.csv, shotzone.csv
Defensedefense.py, organize_defense.pydfg.csv, rimdfg.csv, rim_acc.csv, rimfreq.csv
Trackingnew_tracking.py, passing.py, hustle.pytracking.csv, passing.csv, hustle.csv
Team Statsteam_shooting.py, team_average_scrape.pyteam_shooting.csv, team_avg.csv
Salary Datasalary_scrape.py, salary2.pynba_salaries.csv, salary_spread.csv
Player Indexmake_index.py, make_index2.pyindex_master.csv

Collection Frequency

Scripts are designed to run daily during the NBA season. Rate limiting and sleep delays are implemented to respect API usage policies.
  • Regular Season: Daily updates (October - April)
  • Playoffs: Daily updates with ps=True parameter (April - June)
  • Offseason: Weekly updates for contract/roster changes (July - September)

Season Configuration

Most scripts use a year-based loop configuration:
# Regular season data collection
for year in range(2024, 2026):  # Collects 2024-25 season
    season = f"{year}-{str(year+1)[-2:]}"
    # Fetch and process data...
The ps (playoffs) parameter toggles between regular season and playoff data:
ps = False  # Regular season
ps = True   # Playoffs

Error Handling

Scripts implement basic retry logic and data validation:
try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"Error fetching {url}: {e}")
    continue  # Skip to next iteration

Next Steps

Scraping Pipeline

Deep dive into web scraping mechanics and API integrations

Data Processing

Learn how raw data is transformed into analysis-ready datasets

Build docs developers (and LLMs) love