Skip to main content

Overview

The XGBoostCryptoPredictor class implements a gradient boosting model optimized for cryptocurrency price prediction. It automatically creates 50+ engineered features including returns, moving averages, volatility metrics, momentum indicators, and temporal features. Best for: Short to medium-term predictions (1-168 hours)

Constructor

from models.xgboost_model import XGBoostCryptoPredictor

predictor = XGBoostCryptoPredictor(
    n_estimators=200,
    learning_rate=0.07,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8
)

Parameters

n_estimators
int
default:"200"
Number of boosting trees to train. Higher values increase model complexity and training time.Effect: More trees = better fit but risk of overfittingRecommended range: 100-500
learning_rate
float
default:"0.07"
Step size shrinkage to prevent overfitting. Controls how much each tree contributes.Effect: Lower values require more trees but often produce better generalizationRecommended range: 0.01-0.3
max_depth
int
default:"6"
Maximum depth of each tree. Deeper trees can model more complex patterns.Effect: Higher depth = more complex interactions but higher risk of overfittingRecommended range: 3-10
subsample
float
default:"0.8"
Fraction of training samples used for each tree. Helps prevent overfitting.Effect: Lower values add randomness and reduce overfittingRecommended range: 0.6-1.0
colsample_bytree
float
default:"0.8"
Fraction of features to use when building each tree.Effect: Lower values increase diversity between treesRecommended range: 0.5-1.0

Methods

create_features()

Creates 50+ engineered features from raw OHLCV data.
df_with_features = predictor.create_features(df)
df
pd.DataFrame
required
DataFrame with columns: open, high, low, close, volume (optional), and datetime index.
return
pd.DataFrame
DataFrame with original columns plus:
  • Returns: 1h, 4h, 24h, 7d percentage changes
  • Moving Averages: MA(7,14,30,50) and ratios
  • Exponential MA: EMA(12,26,50)
  • Volatility: Rolling standard deviation (7,14,30 periods)
  • Momentum: 7 and 14 period momentum
  • Bollinger Bands: Upper, lower, middle, position
  • Volume features: Ratios and moving averages (if volume provided)
  • OHLC ratios: high/low, close/open
  • Temporal: hour, day_of_week, day_of_month, month
  • Technical indicators: RSI, MACD features (if present in input)
  • Lags: Close price at t-1, t-2, t-3, t-7, t-14
  • Target: Next period close price

prepare_data()

Prepares data for training with feature engineering and train/test split.
X_train, X_test, y_train, y_test = predictor.prepare_data(
    df, 
    train_size=0.8
)
df
pd.DataFrame
required
DataFrame with OHLCV data and datetime index.
train_size
float
default:"0.8"
Fraction of data to use for training (0.0-1.0). Remaining data used for testing.Note: Uses temporal split, not random split, to preserve time series structure.
return
Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]
Returns tuple of (X_train, X_test, y_train, y_test) with:
  • Features scaled using MinMaxScaler
  • NaN values removed
  • Minimum 100 samples required after feature creation

train()

Trains the XGBoost model and returns performance metrics.
metrics = predictor.train(df, train_size=0.8)
print(f"Test MAPE: {metrics['test_mape']:.2f}%")
print(f"Direction Accuracy: {metrics['test_direction_accuracy']:.2f}%")
df
pd.DataFrame
required
DataFrame with historical OHLCV data.
train_size
float
default:"0.8"
Fraction of data for training.
return
Dict
Dictionary containing:
{
    'train_mae': float,          # Mean Absolute Error on training set
    'test_mae': float,           # Mean Absolute Error on test set
    'train_rmse': float,         # Root Mean Squared Error on training set
    'test_rmse': float,          # Root Mean Squared Error on test set
    'train_mape': float,         # Mean Absolute Percentage Error (%) - train
    'test_mape': float,          # Mean Absolute Percentage Error (%) - test
    'train_direction_accuracy': float,  # Directional accuracy (%) - train
    'test_direction_accuracy': float    # Directional accuracy (%) - test
}
Key Metrics:
  • test_mape: Lower is better (good: <5%, acceptable: <10%)
  • test_direction_accuracy: Higher is better (>50% = better than random)

predict_future()

Generates recursive multi-step forecasts.
# Predict next 24 hours
predictions = predictor.predict_future(df, periods=24)

print(predictions)
#                      predicted_price
# timestamp                           
# 2026-03-08 01:00:00      42500.32
# 2026-03-08 02:00:00      42520.15
# ...
df
pd.DataFrame
required
DataFrame with historical data used to generate initial features.
periods
int
default:"24"
Number of time periods to forecast into the future.Note: Prediction accuracy decreases with longer horizons due to error accumulation.
return
pd.DataFrame
DataFrame with columns:
  • Index: timestamp (datetime)
  • predicted_price: Predicted closing price
Note: Each prediction uses previous predictions as features (recursive forecasting).

get_feature_importance()

Returns feature importance scores to understand model decisions.
importance = predictor.get_feature_importance()
print(importance.head(10))  # Top 10 most important features
return
pd.DataFrame
DataFrame with columns:
  • feature: Feature name
  • importance: Importance score (higher = more important)
Sorted by importance in descending order.

Utility Functions

backtest_model()

Performs comprehensive backtesting with train/test split.
from models.xgboost_model import backtest_model, XGBoostCryptoPredictor

predictor = XGBoostCryptoPredictor(
    n_estimators=300,
    learning_rate=0.05
)

results = backtest_model(df, predictor, train_size=0.8)

print("Metrics:", results['metrics'])
print("Top Features:", results['feature_importance'].head())
df
pd.DataFrame
required
Historical OHLCV data.
predictor
XGBoostCryptoPredictor
required
Initialized predictor instance.
train_size
float
default:"0.8"
Fraction for training.
return
Dict
{
    'metrics': Dict,                    # All training metrics
    'train_actual': np.ndarray,         # Actual training values
    'train_predicted': np.ndarray,      # Predicted training values
    'test_actual': np.ndarray,          # Actual test values
    'test_predicted': np.ndarray,       # Predicted test values
    'feature_importance': pd.DataFrame  # Feature importance ranking
}

create_prediction_intervals()

Adds confidence intervals to predictions.
from models.xgboost_model import create_prediction_intervals

predictions = predictor.predict_future(df, periods=48)
predictions_with_intervals = create_prediction_intervals(
    predictions, 
    confidence=0.95
)

print(predictions_with_intervals)
#                      predicted_price  lower_bound  upper_bound
# timestamp                           
# 2026-03-08 01:00:00      42500.32    41200.15    43800.49
predictions
pd.DataFrame
required
DataFrame with predicted_price column.
confidence
float
default:"0.95"
Confidence level (0.0-1.0). Common values: 0.90, 0.95, 0.99.
return
pd.DataFrame
Original DataFrame with added columns:
  • lower_bound: Lower confidence interval
  • upper_bound: Upper confidence interval
Method: Uses standard deviation of predictions with z-score for specified confidence level.

Complete Example

import pandas as pd
from models.xgboost_model import XGBoostCryptoPredictor, create_prediction_intervals

# Load your data
df = pd.read_csv('btc_hourly.csv', index_col='timestamp', parse_dates=True)

# Initialize predictor with custom hyperparameters
predictor = XGBoostCryptoPredictor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=7,
    subsample=0.85,
    colsample_bytree=0.85
)

# Train the model
print("Training model...")
metrics = predictor.train(df, train_size=0.85)

print(f"Test MAPE: {metrics['test_mape']:.2f}%")
print(f"Test Direction Accuracy: {metrics['test_direction_accuracy']:.2f}%")
print(f"Test RMSE: ${metrics['test_rmse']:.2f}")

# Get feature importance
importance = predictor.get_feature_importance()
print("\nTop 5 Features:")
print(importance.head())

# Make predictions for next 48 hours
predictions = predictor.predict_future(df, periods=48)
predictions = create_prediction_intervals(predictions, confidence=0.95)

print("\nPredictions:")
print(predictions.head())

# Save predictions
predictions.to_csv('xgboost_predictions.csv')

Key Characteristics

Strengths:
  • Excellent short-term accuracy (1-72 hours)
  • Captures complex non-linear patterns
  • Automatic feature importance ranking
  • Robust to outliers
  • Fast training and prediction
Limitations:
  • Accuracy degrades with longer horizons
  • Requires significant historical data (500+ points recommended)
  • Recursive forecasting accumulates errors
  • Less interpretable than linear models
Typical Performance:
  • MAPE: 2-5% for 24h predictions
  • Direction Accuracy: 55-65%
  • Best for: Hourly to 3-day forecasts

Build docs developers (and LLMs) love