Skip to main content

Overview

The Make Index scripts create and maintain the master player index, collecting basic scoring statistics and calculating True Shooting Percentage (TS%). Two versions exist:
  • make_index.py - Legacy version with manual scraping
  • make_index2.py - Modern refactored version with improved error handling

Data Sources

  • Basketball Reference: Player statistics (totals and per-possession)
  • NBA API: Player IDs and current roster data
  • URLs: basketball-reference.com/leagues/NBA_{year}_totals.html and /per_poss.html

Core Functions

pull_bref_data()

Pulls player statistics from Basketball Reference.
totals
boolean
default:"False"
If True, scrapes totals data. If False, scrapes per-possession data.
Returns: pd.DataFrame with columns:
  • player, url, team, year, G, MP, FGA, FG, 3PA, 3P, FTA, FT, PTS
# From make_index2.py:105
def pull_bref_data(totals=False):
    leagues = "playoffs" if config.PLAYOFFS_MODE else "leagues"
    if totals:
        url_pattern = f"https://www.basketball-reference.com/{leagues}/NBA_{{year}}_totals.html"
    else:
        url_pattern = f"https://www.basketball-reference.com/{leagues}/NBA_{{year}}_per_poss.html"

process_player_ids()

Matches Basketball Reference IDs to NBA API IDs.
df
pd.DataFrame
required
DataFrame containing player data with URLs
master_df
pd.DataFrame
required
Master index DataFrame with existing ID mappings
Returns: DataFrame with added bref_id, nba_id, and team_id columns
# From make_index2.py:204
def process_player_ids(df, master_df):
    # Extract Basketball Reference IDs
    df['bref_id'] = df['url'].str.split('/', expand=True)[5].str.split('.', expand=True)[0]
    
    # Map IDs to dataframe
    match_dict = dict(zip(master_df['bref_id'], master_df['nba_id']))
    df['nba_id'] = df['bref_id'].map(match_dict)

calculate_true_shooting()

Calculates True Shooting Percentage using the formula: TS% = PTS / (2 * (FGA + 0.44 * FTA)) * 100
df
pd.DataFrame
required
DataFrame with PTS, FGA, and FTA columns
Returns: DataFrame with added TS% column
# From make_index2.py:269
df['TS%'] = (df['PTS'] / (2 * (df['FGA'] + 0.44 * df['FTA']))) * 100
df.replace([np.inf, -np.inf], 0, inplace=True)
df.loc[df['TS%'] > 150, 'TS%'] = 0  # Clean extreme values

Configuration

Set at the top of make_index2.py:
PLAYOFFS_MODE
boolean
default:"True"
Toggle between playoffs and regular season data
CURRENT_YEAR
integer
default:"2025"
Year to scrape (represents 2024-25 season)
CURRENT_SEASON
string
default:"2024-25"
Season format for NBA API
# From make_index2.py:15-20
class Config:
    PLAYOFFS_MODE = True
    CURRENT_YEAR = 2025
    CURRENT_SEASON = "2024-25"

Output Files

index_master.csv / index_master_ps.csv
CSV
Master player index with ID mappingsColumns: player, url, year, team, bref_id, nba_id, team_id
scoring.csv / scoring_ps.csv
CSV
Per-possession scoring statisticsColumns: Player, TS%, PTS, MP, Tm, G, year, nba_id
totals.csv / totals_ps.csv
CSV
Total scoring statistics with shooting attemptsColumns: Player, TS%, PTS, MP, Tm, G, FTA, FGA, year, nba_id
games.csv / ps_games.csv
CSV
Games played data exported to other modulesColumns: nba_id, Player, year, G

Usage Example

# Set configuration
class Config:
    PLAYOFFS_MODE = False  # Regular season
    CURRENT_YEAR = 2025
    CURRENT_SEASON = "2024-25"

config = Config()

# Run the main pipeline
if __name__ == "__main__":
    main()
Output:
Running in REGULAR SEASON mode
Fetching data from: https://www.basketball-reference.com/leagues/NBA_2025_totals.html
Successfully processed 612 players for 2025 (totals)
Found 15 players without NBA IDs
Fetching player data from NBA API...
Found 12 additional IDs from the NBA API

Key Features

  • Dynamic header mapping: Automatically detects column positions from Basketball Reference HTML
  • ID reconciliation: Matches players across Basketball Reference and NBA API
  • Playoff/regular season toggle: Single ps flag controls data source
  • Hardcoded ID fallbacks: Manual dictionary for players missing from APIs
  • TS% calculation: Industry-standard true shooting percentage formula
  • Data validation: Removes extreme TS% values (>150%) and handles inf/NaN

Build docs developers (and LLMs) love