Skip to main content

Installation

1

Clone the Repository

Clone the project to your local machine:
git clone <repository-url>
cd workspace/source
2

Install Dependencies

Install required Python packages using pip:
pip install -r requirements.txt
Key dependencies:
  • pandas==1.5.3 - Data manipulation
  • requests==2.32.3 - HTTP requests
  • beautifulsoup4==4.12.3 - Web scraping
  • nba_api==1.6.1 - NBA.com API wrapper
  • plotly==5.23.0 - Visualization (optional)
3

Verify Installation

Test that all packages are installed correctly:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from nba_api.stats.endpoints import commonallplayers

print("All dependencies installed successfully!")

Your First Data Collection

Let’s collect hustle statistics for the 2024-25 season.

Running the Hustle Stats Script

import pandas as pd
import requests

def get_hustle(year, ps=False):
    """Fetch hustle statistics from NBA.com Stats API.
    
    Args:
        year: Season ending year (e.g., 2025 for 2024-25 season)
        ps: Boolean, True for playoffs, False for regular season
    """
    stype = "Playoffs" if ps else "Regular%20Season"
    season = str(year-1) + '-' + str(year)[-2:]
    
    # NBA.com Stats API endpoint
    url = (
        'https://stats.nba.com/stats/leaguehustlestatsplayer'
        '?College=&Conference=&Country=&DateFrom=&DateTo=&Division='
        '&DraftPick=&DraftYear=&GameScope=&Height=&ISTRound='
        '&LastNGames=0&LeagueID=00&Location=&Month=0'
        '&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N'
        f'&PerMode=Totals&PlayerExperience=&PlayerPosition='
        f'&PlusMinus=N&Rank=N&Season={season}&SeasonSegment='
        f'&SeasonType={stype}&TeamID=0&VsConference=&VsDivision=&Weight='
    )
    
    # Required headers for NBA.com API
    headers = {
        "Host": "stats.nba.com",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
        "Accept": "application/json, text/plain, */*",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Referer": "https://stats.nba.com/"
    }
    
    # Make the request
    response = requests.get(url, headers=headers)
    json_data = response.json()
    
    # Extract data from JSON response
    data = json_data["resultSets"][0]["rowSet"]
    columns = json_data["resultSets"][0]["headers"]
    
    # Create DataFrame
    df = pd.DataFrame(data, columns=columns)
    df['year'] = year
    
    return df

# Collect 2024-25 regular season hustle stats
df = get_hustle(2025, ps=False)
print(df.head())
print(f"\nCollected {len(df)} players")
print(f"\nColumns: {df.columns.tolist()}")

# Save to CSV
df.to_csv('hustle_2025.csv', index=False)
You’ve successfully collected your first dataset! The CSV file contains hustle metrics like deflections, charges drawn, screen assists, and loose balls recovered.

Understanding the Data

Data Structure

Each script outputs CSV files with a consistent structure:
FieldDescriptionType
PLAYER_IDUnique NBA player identifierInteger
PLAYER_NAMEPlayer’s full nameString
TEAM_ABBREVIATIONTeam acronym (e.g., LAL, BOS)String
GPGames playedInteger
yearSeason ending yearInteger
[stat columns]Various statisticsFloat/Integer

Querying the Data

Once you have CSV files, you can query them with pandas:
import pandas as pd

# Load the data
df = pd.read_csv('hustle_2025.csv')

# Find players with most deflections (min 20 games)
top_deflections = df[df.GP >= 20].nlargest(10, 'DEFLECTIONS')
print(top_deflections[['PLAYER_NAME', 'TEAM_ABBREVIATION', 'GP', 'DEFLECTIONS']])

Running Other Data Collection Scripts

The platform includes scripts for various data types. Here are common examples:

Defense Statistics

Collect opponent shooting data and rim protection metrics:
python defense.py
This generates:
  • dfg.csv - Overall opponent field goal percentage
  • rimdfg.csv - Rim defense (shots within 6 feet)
  • rim_acc.csv - At-rim accuracy allowed
  • rimfreq.csv - At-rim shot frequency faced

Player Shooting by Defender Distance

Collect shooting splits based on closest defender:
python player_shooting.py
Outputs four CSV files:
  • very_tight.csv - 0-2 feet defender distance
  • tight.csv - 2-4 feet
  • open.csv - 4-6 feet
  • wide_open.csv - 6+ feet

Passing & Playmaking

Collect comprehensive passing statistics:
python passing.py
The passing script merges data from pbpstats.com API and NBA.com tracking endpoints to provide complete playmaking metrics.

Dribble-Based Shooting

Analyze shooting by number of dribbles before the shot:
python dribble.py
Creates dribble shooting splits (0, 1, 2, 3-6, 7+ dribbles) for both:
  • Overall shooting (dribbleshot.csv)
  • Catch & shoot vs. pull-ups (jumpdribble.csv)

Working with Master Files

Master CSV files contain all seasons combined for easy multi-year analysis:
import pandas as pd

# Load master file with all seasons
all_passing = pd.read_csv('passing.csv')

# Filter to specific player across all years
luka_passing = all_passing[
    all_passing['Name'].str.contains('Luka Doncic', case=False)
]

print(luka_passing[['Name', 'year', 'Assists', 'Potential Assists', 
                     'High Value Assist %', 'on-ball-time%']])

Directory Structure

Understand where data is stored:
workspace/source/
├── requirements.txt           # Python dependencies
├── defense.py                # Defense data collection
├── hustle.py                 # Hustle stats collection  
├── passing.py                # Passing data collection
├── player_shooting.py        # Shooting by defender distance
├── dribble.py                # Dribble-based shooting
├── make_index.py             # Player/team ID mapping

├── 2024/                     # Year-specific directories
│   ├── defense/
│   │   ├── dfg.csv
│   │   ├── rimdfg.csv
│   │   └── ...
│   ├── player_shooting/
│   │   ├── very_tight.csv
│   │   ├── tight.csv
│   │   └── ...
│   └── playoffs/
│       └── ...

├── hustle.csv                # Master files (all years)
├── passing.csv
├── defense_master.csv
└── index_master.csv          # Player ID mappings

Rate Limiting & Best Practices

Always respect API rate limits to avoid being blocked:
  • NBA.com Stats API: Add 1-3 second delays between requests
  • Basketball Reference: Use 2-3 second delays and proper User-Agent headers
  • pbpstats.com: Implement 3-second delays for team-level loops
Example rate limiting:
import time

for year in range(2014, 2026):
    df = get_hustle(year)
    df.to_csv(f'{year}/hustle.csv', index=False)
    
    # Delay between requests
    time.sleep(3)
    print(f'Completed {year}')

Next Steps

Explore Data Sources

Learn about the three data sources and what each provides

Data Schema Reference

Browse complete field definitions for all datasets

Player Statistics

Explore player-level data collections

API Scripts

Full documentation of all collection scripts

Build docs developers (and LLMs) love