Skip to main content

Overview

The preprocessing stage cleans and standardizes raw NBA player data. The Preprocessor class handles type conversions, missing value imputation, and outlier detection to prepare data for feature engineering.

Preprocessor Class

Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:7

Initialization

class Preprocessor:
    def __init__(self, random_seed: int = 42, missing_strategy: str = 'median')
random_seed
int
default:"42"
Random seed for reproducible preprocessing operations
missing_strategy
str
default:"'median'"
Strategy for handling missing numeric values. Options:
  • 'median': Fill with column median (default, robust to outliers)
  • 'mean': Fill with column mean
  • Other: Fill with 0

Core Methods

clean()

Performs comprehensive data cleaning and type conversion.
def clean(self, df: pd.DataFrame) -> pd.DataFrame
df
pd.DataFrame
required
Raw DataFrame containing NBA player data
Returns: Cleaned DataFrame with standardized types and formats Transformations Applied: Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:12-24

1. Date Parsing

cleaned['b_day'] = pd.to_datetime(cleaned['b_day'], format='%m/%d/%y', errors='coerce')
cleaned['draft_year'] = pd.to_datetime(cleaned['draft_year'], format='%Y', errors='coerce')
  • Converts birth dates from 'MM/DD/YY' format
  • Converts draft years from 'YYYY' format
  • Invalid dates become NaT (coerced)

2. Team Handling

cleaned['team'] = cleaned['team'].fillna('No Team')
  • Players without teams are labeled as 'No Team'

3. Height Conversion

cleaned['height'] = cleaned['height'].astype(str).str.split(' / ').str[1].astype(float)
Example transformation:
Input:  "6-2 / 1.88"
Output: 1.88 (float, meters)
  • Extracts metric height from dual format strings
  • Converts to float for numeric operations

4. Weight Conversion

cleaned['weight'] = cleaned['weight'].astype(str).str.split(' / ').str[1].str.replace(' kg.', '', regex=False).astype(float)
Example transformation:
Input:  "185 lbs. / 83.9 kg."
Output: 83.9 (float, kilograms)
  • Extracts metric weight
  • Removes unit suffix
  • Converts to float

5. Salary Parsing

cleaned['salary'] = cleaned['salary'].astype(str).str.replace('$', '', regex=False).astype(float)
Example transformation:
Input:  "$5000000"
Output: 5000000.0 (float)

6. Country Normalization

cleaned['country'] = np.where(cleaned['country'] == 'USA', 'USA', 'Not-USA')
  • Binary categorization: USA vs international players

7. Draft Round Handling

cleaned['draft_round'] = cleaned['draft_round'].replace('Undrafted', '0')
  • Standardizes undrafted players to round '0'
Example:
from pipeline.preprocessing import Preprocessor
import pandas as pd

preprocessor = Preprocessor(random_seed=42, missing_strategy='median')

raw_df = pd.read_csv('raw_players.csv')
cleaned_df = preprocessor.clean(raw_df)

print(f"Cleaned {len(cleaned_df)} player records")
print(f"Data types: {cleaned_df.dtypes}")
The clean() method automatically calls handle_missing() to impute missing values after transformations.

handle_missing()

Imputes missing values using the configured strategy. Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:26-41
def handle_missing(self, df: pd.DataFrame) -> pd.DataFrame
df
pd.DataFrame
required
DataFrame potentially containing missing values
Returns: DataFrame with all missing values imputed Imputation Logic:

Numeric Columns

num_cols = list(df.select_dtypes(include='number').columns)

if self.missing_strategy == 'median':
    out[num_cols] = out[num_cols].fillna(out[num_cols].median())
elif self.missing_strategy == 'mean':
    out[num_cols] = out[num_cols].fillna(out[num_cols].mean())
else:
    out[num_cols] = out[num_cols].fillna(0)
  • Median (default): Robust to outliers, recommended for skewed distributions
  • Mean: Appropriate for normally distributed data
  • Zero: Fallback for any other strategy value

Categorical Columns

cat_cols = list(df.select_dtypes(include='object').columns)
for col in cat_cols:
    out[col] = out[col].fillna(f'Unknown_{col}')
Example transformation:
Column: 'team'
Missing value → 'Unknown_team'

Column: 'college'
Missing value → 'Unknown_college'
Example:
preprocessor = Preprocessor(missing_strategy='median')

# DataFrame with missing values
df_with_nulls = pd.DataFrame({
    'salary': [1000000, None, 2000000],
    'height': [1.98, 2.01, None],
    'team': ['Lakers', None, 'Bulls']
})

df_imputed = preprocessor.handle_missing(df_with_nulls)
# salary: None → 1500000.0 (median of 1M and 2M)
# height: None → 1.995 (median of 1.98 and 2.01)
# team: None → 'Unknown_team'
Median and mean imputation can introduce bias. For critical analyses, consider more sophisticated methods like MICE or K-NN imputation.

detect_outliers_iqr()

Detects outliers using the Interquartile Range (IQR) method. Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:43-51
def detect_outliers_iqr(
    self, 
    df: pd.DataFrame, 
    multiplier: float = 1.5
) -> pd.Series
df
pd.DataFrame
required
DataFrame to check for outliers (only numeric columns are analyzed)
multiplier
float
default:"1.5"
IQR multiplier for outlier threshold. Common values:
  • 1.5: Standard outlier detection (default)
  • 3.0: Extreme outlier detection (more conservative)
Returns: Boolean Series indicating outlier rows (True = outlier) Algorithm:
numeric = df.select_dtypes(include='number')
q1 = numeric.quantile(0.25)  # First quartile
q3 = numeric.quantile(0.75)  # Third quartile
iqr = q3 - q1                # Interquartile range

# Outlier boundaries
lower_bound = q1 - multiplier * iqr
upper_bound = q3 + multiplier * iqr

# Mark outliers
mask = ((numeric < lower_bound) | (numeric > upper_bound)).any(axis=1)
Visual Representation:

  Q3 ───┤───────────────┐ Upper bound (Q3 + 1.5*IQR)
        │               │
  Q2 ───┤     IQR       │ ← Normal range
        │               │
  Q1 ───┤───────────────┘ Lower bound (Q1 - 1.5*IQR)

       Outliers detected beyond these bounds
Example:
from pipeline.preprocessing import Preprocessor
import pandas as pd

preprocessor = Preprocessor()

df = pd.DataFrame({
    'salary': [1000000, 2000000, 3000000, 50000000],  # Last value is outlier
    'height': [1.80, 1.90, 2.00, 2.10]
})

outlier_mask = preprocessor.detect_outliers_iqr(df, multiplier=1.5)
print(f"Outliers detected: {outlier_mask.sum()} rows")
print(df[outlier_mask])  # Show outlier rows
Use multiplier=1.5 for general outlier detection or multiplier=3.0 for extreme outliers only.

Data Flow

The preprocessing stage follows this flow:

Integration with Pipeline

The preprocessing stage integrates with the streaming engine: Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:37
class RealTimePipelineRunner:
    def __init__(self, config: PipelineConfig):
        self.preprocessor = Preprocessor(config.random_seed)
Batch Processing:
def _process_df(self, df: pd.DataFrame) -> tuple[pd.DataFrame, pd.Series]:
    cleaned = self.preprocessor.clean(df)
    # Continue to feature engineering...
Streaming Processing: Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:79
def _process_stream_chunk(self, chunk: pd.DataFrame, rolling_state) -> tuple:
    cleaned = self.preprocessor.clean(chunk)
    # Stream-compatible processing
Quality Validation: Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:426
cleaned = self.preprocessor.clean(df)
outlier_mask = self.preprocessor.detect_outliers_iqr(
    cleaned.select_dtypes(include='number')
)
quality = self.validator.quality_report(cleaned, outlier_mask)

Performance Considerations

Time Complexity

  • clean(): O(n) where n = number of rows
  • handle_missing(): O(n × m) where m = number of columns
  • detect_outliers_iqr(): O(n × k) where k = number of numeric columns

Memory Efficiency

All methods return copies of DataFrames, preserving the original data.

Streaming Compatibility

✅ All preprocessing methods work seamlessly with chunked data streams.

Best Practices

Choose the Right Strategy

Use 'median' for skewed salary distributions, 'mean' for normally distributed physical measurements.

Monitor Outliers

Track outlier rates over time to detect data quality issues early.

Validate Transformations

Always inspect a sample of cleaned data before processing the full dataset.

Document Assumptions

Record why you chose specific cleaning strategies for reproducibility.

Next Steps

Feature Engineering

Build derived features from cleaned data

Validation

Validate data quality and detect drift

Build docs developers (and LLMs) love