Data Preprocessing

Overview

The preprocessing stage cleans and standardizes raw NBA player data. The Preprocessor class handles type conversions, missing value imputation, and outlier detection to prepare data for feature engineering.

Preprocessor Class

Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:7

Initialization

class Preprocessor:
    def __init__(self, random_seed: int = 42, missing_strategy: str = 'median')

random_seed

int

default:"42"

Random seed for reproducible preprocessing operations

missing_strategy

str

default:"'median'"

Strategy for handling missing numeric values. Options:

'median': Fill with column median (default, robust to outliers)
'mean': Fill with column mean
Other: Fill with 0

Core Methods

clean()

Performs comprehensive data cleaning and type conversion.

def clean(self, df: pd.DataFrame) -> pd.DataFrame

pd.DataFrame

required

Raw DataFrame containing NBA player data

Returns: Cleaned DataFrame with standardized types and formats Transformations Applied: Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:12-24

1. Date Parsing

cleaned['b_day'] = pd.to_datetime(cleaned['b_day'], format='%m/%d/%y', errors='coerce')
cleaned['draft_year'] = pd.to_datetime(cleaned['draft_year'], format='%Y', errors='coerce')

Converts birth dates from 'MM/DD/YY' format
Converts draft years from 'YYYY' format
Invalid dates become NaT (coerced)

2. Team Handling

cleaned['team'] = cleaned['team'].fillna('No Team')

Players without teams are labeled as 'No Team'

3. Height Conversion

cleaned['height'] = cleaned['height'].astype(str).str.split(' / ').str[1].astype(float)

Example transformation:

Input:  "6-2 / 1.88"
Output: 1.88 (float, meters)

Extracts metric height from dual format strings
Converts to float for numeric operations

4. Weight Conversion

cleaned['weight'] = cleaned['weight'].astype(str).str.split(' / ').str[1].str.replace(' kg.', '', regex=False).astype(float)

Example transformation:

Input:  "185 lbs. / 83.9 kg."
Output: 83.9 (float, kilograms)

Extracts metric weight
Removes unit suffix
Converts to float

5. Salary Parsing

cleaned['salary'] = cleaned['salary'].astype(str).str.replace('$', '', regex=False).astype(float)

Example transformation:

Input:  "$5000000"
Output: 5000000.0 (float)

6. Country Normalization

cleaned['country'] = np.where(cleaned['country'] == 'USA', 'USA', 'Not-USA')

Binary categorization: USA vs international players

7. Draft Round Handling

cleaned['draft_round'] = cleaned['draft_round'].replace('Undrafted', '0')

Standardizes undrafted players to round '0'

Example:

from pipeline.preprocessing import Preprocessor
import pandas as pd

preprocessor = Preprocessor(random_seed=42, missing_strategy='median')

raw_df = pd.read_csv('raw_players.csv')
cleaned_df = preprocessor.clean(raw_df)

print(f"Cleaned {len(cleaned_df)} player records")
print(f"Data types: {cleaned_df.dtypes}")

The clean() method automatically calls handle_missing() to impute missing values after transformations.

handle_missing()

Imputes missing values using the configured strategy. Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:26-41

def handle_missing(self, df: pd.DataFrame) -> pd.DataFrame

pd.DataFrame

required

DataFrame potentially containing missing values

Returns: DataFrame with all missing values imputed Imputation Logic:

Numeric Columns

num_cols = list(df.select_dtypes(include='number').columns)

if self.missing_strategy == 'median':
    out[num_cols] = out[num_cols].fillna(out[num_cols].median())
elif self.missing_strategy == 'mean':
    out[num_cols] = out[num_cols].fillna(out[num_cols].mean())
else:
    out[num_cols] = out[num_cols].fillna(0)

Median (default): Robust to outliers, recommended for skewed distributions
Mean: Appropriate for normally distributed data
Zero: Fallback for any other strategy value

Categorical Columns

cat_cols = list(df.select_dtypes(include='object').columns)
for col in cat_cols:
    out[col] = out[col].fillna(f'Unknown_{col}')

Example transformation:

Column: 'team'
Missing value → 'Unknown_team'

Column: 'college'
Missing value → 'Unknown_college'

Example:

preprocessor = Preprocessor(missing_strategy='median')

# DataFrame with missing values
df_with_nulls = pd.DataFrame({
    'salary': [1000000, None, 2000000],
    'height': [1.98, 2.01, None],
    'team': ['Lakers', None, 'Bulls']
})

df_imputed = preprocessor.handle_missing(df_with_nulls)
# salary: None → 1500000.0 (median of 1M and 2M)
# height: None → 1.995 (median of 1.98 and 2.01)
# team: None → 'Unknown_team'

Median and mean imputation can introduce bias. For critical analyses, consider more sophisticated methods like MICE or K-NN imputation.

detect_outliers_iqr()

Detects outliers using the Interquartile Range (IQR) method. Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:43-51

def detect_outliers_iqr(
    self, 
    df: pd.DataFrame, 
    multiplier: float = 1.5
) -> pd.Series

pd.DataFrame

required

DataFrame to check for outliers (only numeric columns are analyzed)

multiplier

float

default:"1.5"

IQR multiplier for outlier threshold. Common values:

1.5: Standard outlier detection (default)
3.0: Extreme outlier detection (more conservative)

Returns: Boolean Series indicating outlier rows (True = outlier) Algorithm:

numeric = df.select_dtypes(include='number')
q1 = numeric.quantile(0.25)  # First quartile
q3 = numeric.quantile(0.75)  # Third quartile
iqr = q3 - q1                # Interquartile range

# Outlier boundaries
lower_bound = q1 - multiplier * iqr
upper_bound = q3 + multiplier * iqr

# Mark outliers
mask = ((numeric < lower_bound) | (numeric > upper_bound)).any(axis=1)

Visual Representation:

        │
  Q3 ───┤───────────────┐ Upper bound (Q3 + 1.5*IQR)
        │               │
  Q2 ───┤     IQR       │ ← Normal range
        │               │
  Q1 ───┤───────────────┘ Lower bound (Q1 - 1.5*IQR)
        │
       Outliers detected beyond these bounds

Example:

from pipeline.preprocessing import Preprocessor
import pandas as pd

preprocessor = Preprocessor()

df = pd.DataFrame({
    'salary': [1000000, 2000000, 3000000, 50000000],  # Last value is outlier
    'height': [1.80, 1.90, 2.00, 2.10]
})

outlier_mask = preprocessor.detect_outliers_iqr(df, multiplier=1.5)
print(f"Outliers detected: {outlier_mask.sum()} rows")
print(df[outlier_mask])  # Show outlier rows

Use multiplier=1.5 for general outlier detection or multiplier=3.0 for extreme outliers only.

Data Flow

The preprocessing stage follows this flow:

Integration with Pipeline

The preprocessing stage integrates with the streaming engine: Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:37

class RealTimePipelineRunner:
    def __init__(self, config: PipelineConfig):
        self.preprocessor = Preprocessor(config.random_seed)

Batch Processing:

def _process_df(self, df: pd.DataFrame) -> tuple[pd.DataFrame, pd.Series]:
    cleaned = self.preprocessor.clean(df)
    # Continue to feature engineering...

Streaming Processing: Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:79

def _process_stream_chunk(self, chunk: pd.DataFrame, rolling_state) -> tuple:
    cleaned = self.preprocessor.clean(chunk)
    # Stream-compatible processing

Quality Validation: Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:426

cleaned = self.preprocessor.clean(df)
outlier_mask = self.preprocessor.detect_outliers_iqr(
    cleaned.select_dtypes(include='number')
)
quality = self.validator.quality_report(cleaned, outlier_mask)

Performance Considerations

Time Complexity

clean(): O(n) where n = number of rows
handle_missing(): O(n × m) where m = number of columns
detect_outliers_iqr(): O(n × k) where k = number of numeric columns

Memory Efficiency

All methods return copies of DataFrames, preserving the original data.

Streaming Compatibility

✅ All preprocessing methods work seamlessly with chunked data streams.

Best Practices

Choose the Right Strategy

Use 'median' for skewed salary distributions, 'mean' for normally distributed physical measurements.

Monitor Outliers

Track outlier rates over time to detect data quality issues early.

Validate Transformations

Always inspect a sample of cleaned data before processing the full dataset.

Document Assumptions

Record why you chose specific cleaning strategies for reproducibility.

Next Steps

Feature Engineering

Build derived features from cleaned data

Validation

Validate data quality and detect drift

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

Overview

Preprocessor Class

Initialization

Core Methods

clean()

1. Date Parsing

2. Team Handling

3. Height Conversion

4. Weight Conversion

5. Salary Parsing

6. Country Normalization

7. Draft Round Handling

handle_missing()

Numeric Columns

Categorical Columns

detect_outliers_iqr()

Data Flow

Integration with Pipeline

Performance Considerations

Time Complexity

Memory Efficiency

Streaming Compatibility

Best Practices

Choose the Right Strategy

Monitor Outliers

Validate Transformations

Document Assumptions

Next Steps

Feature Engineering

Validation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

​Overview

​Preprocessor Class

​Initialization

​Core Methods

​clean()

​1. Date Parsing

​2. Team Handling

​3. Height Conversion

​4. Weight Conversion

​5. Salary Parsing

​6. Country Normalization

​7. Draft Round Handling

​handle_missing()

​Numeric Columns

​Categorical Columns

​detect_outliers_iqr()

​Data Flow

​Integration with Pipeline

​Performance Considerations

​Time Complexity

​Memory Efficiency

​Streaming Compatibility

​Best Practices

Choose the Right Strategy

Monitor Outliers

Validate Transformations

Document Assumptions

​Next Steps

Feature Engineering

Validation

Build docs developers (and LLMs) love

Overview

Preprocessor Class

Initialization

Core Methods

clean()

1. Date Parsing

2. Team Handling

3. Height Conversion

4. Weight Conversion

5. Salary Parsing

6. Country Normalization

7. Draft Round Handling

handle_missing()

Numeric Columns

Categorical Columns

detect_outliers_iqr()

Data Flow

Integration with Pipeline

Performance Considerations

Time Complexity

Memory Efficiency

Streaming Compatibility

Best Practices

Next Steps