Overview
The preprocessing stage cleans and standardizes raw NBA player data. The Preprocessor class handles type conversions, missing value imputation, and outlier detection to prepare data for feature engineering.
Preprocessor Class
Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:7
Initialization
class Preprocessor :
def __init__ ( self , random_seed : int = 42 , missing_strategy : str = 'median' )
Random seed for reproducible preprocessing operations
Strategy for handling missing numeric values. Options:
'median': Fill with column median (default, robust to outliers)
'mean': Fill with column mean
Other: Fill with 0
Core Methods
clean()
Performs comprehensive data cleaning and type conversion.
def clean ( self , df : pd.DataFrame) -> pd.DataFrame
Raw DataFrame containing NBA player data
Returns: Cleaned DataFrame with standardized types and formats
Transformations Applied:
Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:12-24
1. Date Parsing
cleaned[ 'b_day' ] = pd.to_datetime(cleaned[ 'b_day' ], format = '%m/ %d /%y' , errors = 'coerce' )
cleaned[ 'draft_year' ] = pd.to_datetime(cleaned[ 'draft_year' ], format = '%Y' , errors = 'coerce' )
Converts birth dates from 'MM/DD/YY' format
Converts draft years from 'YYYY' format
Invalid dates become NaT (coerced)
2. Team Handling
cleaned[ 'team' ] = cleaned[ 'team' ].fillna( 'No Team' )
Players without teams are labeled as 'No Team'
3. Height Conversion
cleaned[ 'height' ] = cleaned[ 'height' ].astype( str ).str.split( ' / ' ).str[ 1 ].astype( float )
Example transformation:
Input: "6-2 / 1.88"
Output: 1.88 (float, meters)
Extracts metric height from dual format strings
Converts to float for numeric operations
4. Weight Conversion
cleaned[ 'weight' ] = cleaned[ 'weight' ].astype( str ).str.split( ' / ' ).str[ 1 ].str.replace( ' kg.' , '' , regex = False ).astype( float )
Example transformation:
Input: "185 lbs. / 83.9 kg."
Output: 83.9 (float, kilograms)
Extracts metric weight
Removes unit suffix
Converts to float
5. Salary Parsing
cleaned[ 'salary' ] = cleaned[ 'salary' ].astype( str ).str.replace( '$' , '' , regex = False ).astype( float )
Example transformation:
Input: "$5000000"
Output: 5000000.0 (float)
6. Country Normalization
cleaned[ 'country' ] = np.where(cleaned[ 'country' ] == 'USA' , 'USA' , 'Not-USA' )
Binary categorization: USA vs international players
7. Draft Round Handling
cleaned[ 'draft_round' ] = cleaned[ 'draft_round' ].replace( 'Undrafted' , '0' )
Standardizes undrafted players to round '0'
Example:
from pipeline.preprocessing import Preprocessor
import pandas as pd
preprocessor = Preprocessor( random_seed = 42 , missing_strategy = 'median' )
raw_df = pd.read_csv( 'raw_players.csv' )
cleaned_df = preprocessor.clean(raw_df)
print ( f "Cleaned { len (cleaned_df) } player records" )
print ( f "Data types: { cleaned_df.dtypes } " )
The clean() method automatically calls handle_missing() to impute missing values after transformations.
handle_missing()
Imputes missing values using the configured strategy.
Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:26-41
def handle_missing ( self , df : pd.DataFrame) -> pd.DataFrame
DataFrame potentially containing missing values
Returns: DataFrame with all missing values imputed
Imputation Logic:
Numeric Columns
num_cols = list (df.select_dtypes( include = 'number' ).columns)
if self .missing_strategy == 'median' :
out[num_cols] = out[num_cols].fillna(out[num_cols].median())
elif self .missing_strategy == 'mean' :
out[num_cols] = out[num_cols].fillna(out[num_cols].mean())
else :
out[num_cols] = out[num_cols].fillna( 0 )
Median (default): Robust to outliers, recommended for skewed distributions
Mean: Appropriate for normally distributed data
Zero: Fallback for any other strategy value
Categorical Columns
cat_cols = list (df.select_dtypes( include = 'object' ).columns)
for col in cat_cols:
out[col] = out[col].fillna( f 'Unknown_ { col } ' )
Example transformation:
Column: 'team'
Missing value → 'Unknown_team'
Column: 'college'
Missing value → 'Unknown_college'
Example:
preprocessor = Preprocessor( missing_strategy = 'median' )
# DataFrame with missing values
df_with_nulls = pd.DataFrame({
'salary' : [ 1000000 , None , 2000000 ],
'height' : [ 1.98 , 2.01 , None ],
'team' : [ 'Lakers' , None , 'Bulls' ]
})
df_imputed = preprocessor.handle_missing(df_with_nulls)
# salary: None → 1500000.0 (median of 1M and 2M)
# height: None → 1.995 (median of 1.98 and 2.01)
# team: None → 'Unknown_team'
Median and mean imputation can introduce bias. For critical analyses, consider more sophisticated methods like MICE or K-NN imputation.
detect_outliers_iqr()
Detects outliers using the Interquartile Range (IQR) method.
Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:43-51
def detect_outliers_iqr (
self ,
df : pd.DataFrame,
multiplier : float = 1.5
) -> pd.Series
DataFrame to check for outliers (only numeric columns are analyzed)
IQR multiplier for outlier threshold. Common values:
1.5: Standard outlier detection (default)
3.0: Extreme outlier detection (more conservative)
Returns: Boolean Series indicating outlier rows (True = outlier)
Algorithm:
numeric = df.select_dtypes( include = 'number' )
q1 = numeric.quantile( 0.25 ) # First quartile
q3 = numeric.quantile( 0.75 ) # Third quartile
iqr = q3 - q1 # Interquartile range
# Outlier boundaries
lower_bound = q1 - multiplier * iqr
upper_bound = q3 + multiplier * iqr
# Mark outliers
mask = ((numeric < lower_bound) | (numeric > upper_bound)).any( axis = 1 )
Visual Representation:
│
Q3 ───┤───────────────┐ Upper bound (Q3 + 1.5*IQR)
│ │
Q2 ───┤ IQR │ ← Normal range
│ │
Q1 ───┤───────────────┘ Lower bound (Q1 - 1.5*IQR)
│
Outliers detected beyond these bounds
Example:
from pipeline.preprocessing import Preprocessor
import pandas as pd
preprocessor = Preprocessor()
df = pd.DataFrame({
'salary' : [ 1000000 , 2000000 , 3000000 , 50000000 ], # Last value is outlier
'height' : [ 1.80 , 1.90 , 2.00 , 2.10 ]
})
outlier_mask = preprocessor.detect_outliers_iqr(df, multiplier = 1.5 )
print ( f "Outliers detected: { outlier_mask.sum() } rows" )
print (df[outlier_mask]) # Show outlier rows
Use multiplier=1.5 for general outlier detection or multiplier=3.0 for extreme outliers only.
Data Flow
The preprocessing stage follows this flow:
Integration with Pipeline
The preprocessing stage integrates with the streaming engine:
Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:37
class RealTimePipelineRunner :
def __init__ ( self , config : PipelineConfig):
self .preprocessor = Preprocessor(config.random_seed)
Batch Processing:
def _process_df ( self , df : pd.DataFrame) -> tuple[pd.DataFrame, pd.Series]:
cleaned = self .preprocessor.clean(df)
# Continue to feature engineering...
Streaming Processing:
Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:79
def _process_stream_chunk ( self , chunk : pd.DataFrame, rolling_state ) -> tuple :
cleaned = self .preprocessor.clean(chunk)
# Stream-compatible processing
Quality Validation:
Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:426
cleaned = self .preprocessor.clean(df)
outlier_mask = self .preprocessor.detect_outliers_iqr(
cleaned.select_dtypes( include = 'number' )
)
quality = self .validator.quality_report(cleaned, outlier_mask)
Time Complexity
clean() : O(n) where n = number of rows
handle_missing() : O(n × m) where m = number of columns
detect_outliers_iqr() : O(n × k) where k = number of numeric columns
Memory Efficiency
All methods return copies of DataFrames, preserving the original data.
Streaming Compatibility
✅ All preprocessing methods work seamlessly with chunked data streams.
Best Practices
Choose the Right Strategy Use 'median' for skewed salary distributions, 'mean' for normally distributed physical measurements.
Monitor Outliers Track outlier rates over time to detect data quality issues early.
Validate Transformations Always inspect a sample of cleaned data before processing the full dataset.
Document Assumptions Record why you chose specific cleaning strategies for reproducibility.
Next Steps
Feature Engineering Build derived features from cleaned data
Validation Validate data quality and detect drift