The feature engineering stage transforms cleaned data into ML-ready features. The FeatureEngineer class creates derived features, computes rolling statistics, handles multicollinearity, and prepares final encoded datasets for training.
Creates comprehensive feature set for batch processing.Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/feature_engineering/features.py:31
Purpose: Capture salary trends over timeExample with window=3:
Salaries: [1M, 2M, 3M, 4M, 5M]roll_mean: [1M, # mean of [1M] 1.5M, # mean of [1M, 2M] 2M, # mean of [1M, 2M, 3M] 3M, # mean of [2M, 3M, 4M] 4M] # mean of [3M, 4M, 5M]
Stream-compatible feature engineering with stateful rolling statistics.Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/feature_engineering/features.py:54
Mutable state object maintaining rolling window history across chunks
Returns: Featured DataFrame chunkKey Difference from Batch Mode:Instead of pandas .rolling(), uses stateful computation:Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/feature_engineering/features.py:20-29
def _add_streaming_rolling_features( self, featured: pd.DataFrame, state: RollingState) -> pd.DataFrame: means, stds = [], [] for salary in featured['salary'].astype(float): state.salary_window.append(float(salary)) # Update window vals = np.array(state.salary_window, dtype=float) means.append(float(vals.mean())) stds.append(float(vals.std(ddof=0)) if len(vals) > 1 else 0.0) featured['salary_roll_mean'] = means featured['salary_roll_std'] = stds return featured
Streaming Example:
engineer = FeatureEngineer()state = engineer.init_rolling_state(rolling_window=5)# Process chunks sequentiallyfor chunk in stream_chunks: featured_chunk = engineer.build_features_streaming(chunk, state) # State persists across chunks!
The state object must be passed to maintain rolling window continuity across chunks.
Removes highly correlated features to reduce redundancy.Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/feature_engineering/features.py:75
Correlation threshold. Pairs with |correlation| > threshold are reduced. Range: [0, 1]
Returns: DataFrame with redundant features removedAlgorithm:Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/feature_engineering/features.py:79-93
while len(numeric_columns) > 1: # Compute correlation matrix corr_matrix = transformed[variable_columns].corr().fillna(0.0) upper_triangle = corr_matrix.where( np.triu(np.ones(corr_matrix.shape), k=1).astype(bool) ) pairs = upper_triangle.stack() # Find highest correlation if pairs.empty or pairs.abs().max() <= threshold: break first, second = pairs.abs().idxmax() # Drop the feature less correlated with target (salary) target_corr = transformed[variable_columns].corrwith( transformed['salary'] ).abs().fillna(0.0) drop_col = first if target_corr[first] < target_corr[second] else second transformed = transformed.drop(columns=drop_col)
Prepares final ML-ready dataset with encoding and normalization.Location: ~/workspace/source/NBA Data Preprocessing/task/pipeline/feature_engineering/features.py:97
y = df['salary'].copy()x = df.drop(columns=[ 'salary', # Target 'version', # Identifier 'b_day', # Raw temporal (replaced by birth_month) 'draft_year', # Raw temporal (replaced by draft_decade) 'weight', # Raw measurement (replaced by bmi) 'height' # Raw measurement (replaced by bmi)], errors='ignore')