Overview
TheXGBoostCryptoPredictor class implements a gradient boosting model optimized for cryptocurrency price prediction. It automatically creates 50+ engineered features including returns, moving averages, volatility metrics, momentum indicators, and temporal features.
Best for: Short to medium-term predictions (1-168 hours)
Constructor
Parameters
Number of boosting trees to train. Higher values increase model complexity and training time.Effect: More trees = better fit but risk of overfittingRecommended range: 100-500
Step size shrinkage to prevent overfitting. Controls how much each tree contributes.Effect: Lower values require more trees but often produce better generalizationRecommended range: 0.01-0.3
Maximum depth of each tree. Deeper trees can model more complex patterns.Effect: Higher depth = more complex interactions but higher risk of overfittingRecommended range: 3-10
Fraction of training samples used for each tree. Helps prevent overfitting.Effect: Lower values add randomness and reduce overfittingRecommended range: 0.6-1.0
Fraction of features to use when building each tree.Effect: Lower values increase diversity between treesRecommended range: 0.5-1.0
Methods
create_features()
Creates 50+ engineered features from raw OHLCV data.DataFrame with columns:
open, high, low, close, volume (optional), and datetime index.DataFrame with original columns plus:
- Returns: 1h, 4h, 24h, 7d percentage changes
- Moving Averages: MA(7,14,30,50) and ratios
- Exponential MA: EMA(12,26,50)
- Volatility: Rolling standard deviation (7,14,30 periods)
- Momentum: 7 and 14 period momentum
- Bollinger Bands: Upper, lower, middle, position
- Volume features: Ratios and moving averages (if volume provided)
- OHLC ratios: high/low, close/open
- Temporal: hour, day_of_week, day_of_month, month
- Technical indicators: RSI, MACD features (if present in input)
- Lags: Close price at t-1, t-2, t-3, t-7, t-14
- Target: Next period close price
prepare_data()
Prepares data for training with feature engineering and train/test split.DataFrame with OHLCV data and datetime index.
Fraction of data to use for training (0.0-1.0). Remaining data used for testing.Note: Uses temporal split, not random split, to preserve time series structure.
Returns tuple of (X_train, X_test, y_train, y_test) with:
- Features scaled using MinMaxScaler
- NaN values removed
- Minimum 100 samples required after feature creation
train()
Trains the XGBoost model and returns performance metrics.DataFrame with historical OHLCV data.
Fraction of data for training.
Dictionary containing:Key Metrics:
test_mape: Lower is better (good: <5%, acceptable: <10%)test_direction_accuracy: Higher is better (>50% = better than random)
predict_future()
Generates recursive multi-step forecasts.DataFrame with historical data used to generate initial features.
Number of time periods to forecast into the future.Note: Prediction accuracy decreases with longer horizons due to error accumulation.
DataFrame with columns:
- Index:
timestamp(datetime) predicted_price: Predicted closing price
get_feature_importance()
Returns feature importance scores to understand model decisions.DataFrame with columns:
feature: Feature nameimportance: Importance score (higher = more important)
Utility Functions
backtest_model()
Performs comprehensive backtesting with train/test split.Historical OHLCV data.
Initialized predictor instance.
Fraction for training.
create_prediction_intervals()
Adds confidence intervals to predictions.DataFrame with
predicted_price column.Confidence level (0.0-1.0). Common values: 0.90, 0.95, 0.99.
Original DataFrame with added columns:
lower_bound: Lower confidence intervalupper_bound: Upper confidence interval
Complete Example
Key Characteristics
Strengths:- Excellent short-term accuracy (1-72 hours)
- Captures complex non-linear patterns
- Automatic feature importance ranking
- Robust to outliers
- Fast training and prediction
- Accuracy degrades with longer horizons
- Requires significant historical data (500+ points recommended)
- Recursive forecasting accumulates errors
- Less interpretable than linear models
- MAPE: 2-5% for 24h predictions
- Direction Accuracy: 55-65%
- Best for: Hourly to 3-day forecasts