Data Processing

Processing Overview

Raw data from scraping APIs requires extensive processing before analysis:

Player ID Mapping

Map Basketball Reference IDs to NBA.com IDs for cross-dataset joins

Data Cleaning

Handle missing values, standardize names, convert data types

Dataset Merging

Combine multiple sources (tracking + shooting + defense)

Master File Generation

Append new season data to historical master CSVs

Player ID Mapping

The ID Mapping Challenge

Different sources use different player identifiers:

Basketball Reference: URL-based ID (curryst01)
NBA.com API: Numeric ID (201939)
PBPStats: Name-based lookup
Spotrac/HoopsHype: Player name strings

Building the Master Index

The make_index.py and make_index2.py scripts create index_master.csv:

import pandas as pd
from nba_api.stats.endpoints import commonallplayers

def pull_bref(ps=False, totals=False):
    # Scrape Basketball Reference
    # Extract player URLs
    index_frame['bref_id'] = index_frame['url'].str.split('/', expand=True)[5]
    index_frame['bref_id'] = index_frame['bref_id'].str.split('.', expand=True)[0]
    
    return index_frame

Mapping bref_id to nba_id

Load Existing Mappings

Read historical mappings from index_master.csv:

master = pd.read_csv('index_master.csv')
match_dict = dict(zip(master['bref_id'], master['nba_id']))

Apply Known Mappings

Map known Basketball Reference IDs to NBA IDs:

index_frame['nba_id'] = index_frame['bref_id'].map(match_dict)

Handle Manual Exceptions

Some players require manual ID mapping:

search_dict = {
    "hollaro01": 1641842,   # Ron Holland
    "sarral01": 1642259,    # Luka Samanic
    "dadiepa01": 1642359,   # Papa Dadiet
    "cuiyo01": 1642385,     # Yongxi Cui
    "shannte01": 1630545    # Shaedon Sharpe
}
match_dict.update(search_dict)
index_frame['nba_id'] = index_frame['bref_id'].map(match_dict)

Fetch Missing Players from NBA API

Use nba_api to find new players:

from nba_api.stats.endpoints import commonallplayers

current_season = "2024-25"
players_data = commonallplayers.CommonAllPlayers(
    is_only_current_season=1,
    season=current_season
)
players_list = players_data.get_data_frames()[0]
player_names = dict(zip(
    players_list['DISPLAY_FIRST_LAST'],
    players_list['PERSON_ID']
))

# Map remaining players by name
notfound = index_frame[index_frame.nba_id.isna()]
notfound['nba_id'] = notfound['player'].map(player_names)

Update Master Index

Append new mappings and save:

index_copy = index_frame[[
    'player', 'url', 'year', 'team', 'bref_id', 'nba_id', 'team_id'
]]
master = master[master.year != 2025]
master = pd.concat([master, index_copy])
master.drop_duplicates(inplace=True)
master.to_csv('index_master.csv', index=False)

Team ID Mapping

Team acronyms are mapped to numeric IDs:

team_dict = dict(zip(master['team'], master['team_id']))
index_frame['team_id'] = index_frame['team'].map(team_dict)

Data Cleaning

Missing Value Handling

Different strategies for different data types:

# Fill numeric columns with 0
df.fillna(0, inplace=True)

# Replace empty strings with 0 before type conversion
df.replace('', 0, inplace=True)

# Drop rows missing critical IDs
df.dropna(subset='bref_id', inplace=True)

Type Conversions

Ensure proper data types for calculations:

# Convert string columns to numeric
index_frame['FTA'] = index_frame['FTA'].astype(float)
index_frame['FGA'] = index_frame['FGA'].astype(float)
index_frame['PTS'] = index_frame['PTS'].astype(float)

# Handle integer conversions for counting stats
data_col = ['FGM', 'FGA', '2FGM', '2FGA', '3PM', '3PA']
for col in data_col:
    df[col] = df[col].astype(int)

Calculated Fields

Derive advanced metrics from raw stats:

# True Shooting Percentage
index_frame['TS%'] = (
    index_frame['PTS'] / 
    (2 * (index_frame['FGA'] + 0.44 * index_frame['FTA']))
) * 100

# Cap unrealistic percentages
df.replace([np.inf, -np.inf], 0, inplace=True)
df.loc[df['TS%'] > 150, 'TS%'] = 0

Name Standardization

Critical for salary data joins:

def standardize_names(df, name_column):
    """
    Standardizes player names with enhanced handling of special cases.
    """
    def clean_name(name):
        name = str(name).strip()
        
        # Fix specific formatting issues
        name_fixes = {
            'Bub Carrington': 'Ja\'Von Carrington',
            'G.G. Jackson': 'Gregory Jackson II',
            'Nah\'Shon Hyland': 'Bones Hyland',
            'Sviatoslav Mykhailiuk': 'Svi Mykhailiuk'
        }
        
        if name in name_fixes:
            return name_fixes[name]
        
        # Handle suffixes consistently
        suffix_map = {
            r'Jr\\.?$': 'Jr.',
            r'Sr\\.?$': 'Sr.',
            r'III$': 'III',
            r'II$': 'II'
        }
        
        cleaned_name = name
        for pattern, replacement in suffix_map.items():
            if re.search(pattern, cleaned_name, flags=re.IGNORECASE):
                base_name = re.sub(pattern, '', cleaned_name).strip()
                cleaned_name = f"{base_name} {replacement}"
        
        return ' '.join(cleaned_name.split())
    
    result_df = df.copy()
    result_df['standardized_name'] = result_df[name_column].apply(clean_name)
    return result_df

Fuzzy Name Matching

For salary data with inconsistent naming:

from fuzzywuzzy import fuzz
from unidecode import unidecode

def match_names(salary_df, index_df, threshold=85):
    """
    Match names using fuzzy string matching and containment checking.
    """
    def clean_name(name):
        return unidecode(str(name).lower().strip())
    
    salary_names = salary_df['name'].apply(clean_name)
    index_names = index_df['player'].apply(clean_name)
    
    matched_ids = []
    for salary_name in salary_names:
        max_score = 0
        best_id = None
        
        for idx_name, nba_id in zip(index_names, index_df['nba_id']):
            # Check containment
            if salary_name in idx_name:
                length_ratio = len(salary_name) / len(idx_name)
                if length_ratio >= 0.5:
                    score = 100 * length_ratio
                    if score > max_score:
                        max_score = score
                        best_id = nba_id
            
            # Try fuzzy matching
            score = fuzz.ratio(salary_name, idx_name)
            if score > max_score:
                max_score = score
                best_id = nba_id
        
        matched_ids.append(best_id if max_score >= threshold else None)
    
    salary_df['nba_id'] = matched_ids
    return salary_df

Dataset Merging

Passing + Tracking Integration

Combining PBPStats totals with NBA tracking data:

def passing_data(ps=False, update=True):
    # Fetch PBPStats totals
    url = 'https://api.pbpstats.com/get-totals/nba'
    response = requests.get(url, params=params)
    df = pd.DataFrame(response.json()["multi_row_table_data"])
    
    # Load tracking data
    df2 = pd.read_csv('tracking/passing.csv')  # NBA passing stats
    df3 = pd.read_csv('tracking/touches.csv')  # NBA touches stats
    
    # Standardize join key
    df['nba_id'] = df['PLAYER_ID'].astype(int)
    df2['nba_id'] = df2['PLAYER_ID'].astype(int)
    df3['nba_id'] = df3['PLAYER_ID'].astype(int)
    
    # Merge datasets
    merged = df.merge(df2, on='nba_id', how='left')
    merged = merged.merge(df3, on='nba_id', how='left')
    
    # Calculate derived metrics
    merged['High Value Assist %'] = (
        100 * (merged['ThreePtAssists'] + merged['AtRimAssists']) / 
        merged['Assists']
    )
    merged['on-ball-time%'] = (
        100 * 2 * merged['TIME_OF_POSS'] / merged['Minutes']
    )
    merged['Assist PPP'] = (
        merged['AST_PTS_CREATED'] / merged['POTENTIAL_AST']
    )
    
    return merged

Defense Data Aggregation

Combining rim protection, DFG%, and frequency stats:

def update_masters(masters, ps=False):
    trail = '_p' if ps else ''
    frames = {master: [] for master in masters}
    
    for year in range(2014, 2026):
        path = f"{year}/{'playoffs/' if ps else ''}defense/"
        
        for file in masters:  # ['rimfreq', 'rim_acc', 'dfg', 'rimdfg']
            df = pd.read_csv(path + file + '.csv')
            frames[file].append(df)
    
    for master in masters:
        masterframe = pd.concat(frames[master])
        masterframe.to_csv(f'{master}{trail}.csv', index=False)

Master File Updates

Append New Season Pattern

Standard workflow for updating master CSVs:

def update_master(master_file, new_data, year):
    # Load existing master
    old = pd.read_csv(master_file)
    
    # Remove current year (in case of re-run)
    old = old[old.year != year]
    
    # Add year column to new data
    new_data['year'] = year
    
    # Concatenate
    updated = pd.concat([old, new_data])
    
    # Save
    updated.to_csv(master_file, index=False)

Hustle Stats Example

Combining hustle + speed/distance + possessions:

def hustle_master(ps=False):
    trail = '_ps' if ps else ''
    
    # Load old data (exclude current season)
    old_df = pd.read_csv('hustle.csv')
    old_df = old_df[old_df.year < 2025]
    
    # Fetch new season data
    df = get_hustle(2025, ps=ps)  # Returns merged hustle + speed + poss
    
    # Combine
    hustle = pd.concat([old_df, df])
    hustle.to_csv(f'hustle{trail}.csv', index=False)
    
    return hustle

Shooting Stats Consolidation

Aggregating multiple defender distance categories:

def master_shooting(playoffs=False):
    data = []
    
    for year in range(2014, 2026):
        path = f"{year}/{'/playoffs' if playoffs else ''}/player_shooting/"
        files = ['wide_open', 'open', 'tight', 'very_tight']
        
        for file in files:
            df = pd.read_csv(path + file + '.csv')
            df['year'] = year
            df['shot_type'] = file
            data.append(df)
    
    master = pd.concat(data)
    return master

Dribble Shot Aggregation

Combining catch-and-shoot + pull-up by dribble count:

def get_dribbleshots2(years, ps=False):
    dribbles = ['0%20Dribbles', '1%20Dribble', '2%20Dribbles', 
                '3-6%20Dribbles', '7%2B%20Dribbles']
    dataframe = []
    
    for year in years:
        for dribble in dribbles:
            # Fetch catch-and-shoot
            url = build_url(dribble, "Catch%20and%20Shoot")
            cs_df = fetch_shooting_data(url)
            
            # Fetch pull-ups
            url = build_url(dribble, "Pullups")
            pullup_df = fetch_shooting_data(url)
            
            # Append both
            dataframe.extend([cs_df, pullup_df])
    
    # Group by player + dribble count (sum across shot types)
    combined = pd.concat(dataframe)
    result = combined.groupby([
        'PLAYER_ID', 'PLAYER', 'TEAM', 'dribbles', 'year'
    ]).sum(numeric_only=True).reset_index()
    
    return result

Data Validation

Common Checks

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Verify expected columns exist
required_cols = ['nba_id', 'player', 'year', 'team']
assert all(col in df.columns for col in required_cols)

# Check for negative values in counting stats
assert (df[['FGM', 'FGA', 'PTS']] >= 0).all().all()

# Verify year range
assert df['year'].between(2014, 2026).all()

# Check ID uniqueness per season
assert not df.duplicated(subset=['nba_id', 'year']).any()

Logging Unmatched Players

# Log players without NBA IDs
notfound = index_frame[index_frame.nba_id.isna()]
print(f"Unmatched players: {len(notfound)}")
print(notfound[['player', 'team', 'year']])

# Log fuzzy match quality
low_confidence = [
    name for name, info in mappings.items()
    if info['score'] and info['score'] < 95
]
print(f"Low confidence matches: {len(low_confidence)}")

Column Selection

Final output column standardization:

# Passing data output columns
columns = [
    'nba_id', 'Name', 'Points', 'on-ball-time%', 'on-ball-time',
    'UAPTS', 'TSA', 'OffPoss', 'Potential Assists', 'Travels',
    'TsPct', 'Turnovers', 'Passes', 'PASSES_RECEIVED',
    'PotAss/Passes', 'UAFGM', 'High Value Assist %', 'Assist PPP',
    'TOUCHES', 'AVG_SEC_PER_TOUCH', 'AVG_DRIB_PER_TOUCH',
    'PTS_PER_TOUCH', 'SECONDARY_AST', 'POTENTIAL_AST',
    'AST_PTS_CREATED', 'Assists', 'GP', 'Minutes', 'year'
]
output = merged[columns]
output.to_csv('passing.csv', index=False)

Performance Considerations

Large datasets (2014-2025 spanning 12 seasons) can be memory-intensive.

Optimization Strategies

Incremental Updates: Only fetch current season, append to historical data
Selective Loading: Use pd.read_csv(usecols=[...]) to load only needed columns
Type Optimization: Convert float64 → float32 for percentage columns
Chunk Processing: For extremely large files, use pd.read_csv(chunksize=10000)

# Example: Memory-efficient update
def efficient_update(master_file, year):
    # Load only non-current-year data
    old = pd.read_csv(
        master_file,
        usecols=['nba_id', 'player', 'year', 'PTS', 'TS%']
    )
    old = old[old.year != year]  # Filter before full load
    
    # Fetch new data
    new = fetch_current_season(year)
    
    # Append and save
    pd.concat([old, new]).to_csv(master_file, index=False)

Get Started

Data Collections

Scripts & Automation

Data Schema

Processing Overview

Player ID Mapping

The ID Mapping Challenge

Building the Master Index

Mapping bref_id to nba_id

Team ID Mapping

Data Cleaning

Missing Value Handling

Type Conversions

Calculated Fields

Name Standardization

Fuzzy Name Matching

Dataset Merging

Passing + Tracking Integration

Defense Data Aggregation

Master File Updates

Append New Season Pattern

Hustle Stats Example

Shooting Stats Consolidation

Dribble Shot Aggregation

Data Validation

Common Checks

Logging Unmatched Players

Column Selection

Performance Considerations

Optimization Strategies

Next Steps

Data Collection

Scraping Pipeline

Build docs developers (and LLMs) love

Get Started

Data Collections

Scripts & Automation

Data Schema

​Processing Overview

​Player ID Mapping

​The ID Mapping Challenge

​Building the Master Index

​Mapping bref_id to nba_id

​Team ID Mapping

​Data Cleaning

​Missing Value Handling

​Type Conversions

​Calculated Fields

​Name Standardization

​Fuzzy Name Matching

​Dataset Merging

​Passing + Tracking Integration

​Defense Data Aggregation

​Master File Updates

​Append New Season Pattern

​Hustle Stats Example

​Shooting Stats Consolidation

​Dribble Shot Aggregation

​Data Validation

​Common Checks

​Logging Unmatched Players

​Column Selection

​Performance Considerations

​Optimization Strategies

​Next Steps

Data Collection

Scraping Pipeline

Build docs developers (and LLMs) love

Processing Overview

Player ID Mapping

The ID Mapping Challenge

Building the Master Index

Mapping bref_id to nba_id

Team ID Mapping

Data Cleaning

Missing Value Handling

Type Conversions

Calculated Fields

Name Standardization

Fuzzy Name Matching

Dataset Merging

Passing + Tracking Integration

Defense Data Aggregation

Master File Updates

Append New Season Pattern

Hustle Stats Example

Shooting Stats Consolidation

Dribble Shot Aggregation

Data Validation

Common Checks

Logging Unmatched Players

Column Selection

Performance Considerations

Optimization Strategies

Next Steps