H2OEstimator. Every estimator follows the same pattern: instantiate with hyperparameters, then call train() with a dataset.
Available estimators
H2OGradientBoostingEstimator
Gradient Boosting Machine (GBM). Supports regression and classification.
H2OXGBoostEstimator
XGBoost integration. Requires the XGBoost extension.
H2ORandomForestEstimator
Distributed Random Forest (DRF).
H2ODeepLearningEstimator
Fully-connected deep neural network.
H2OGeneralizedLinearEstimator
Generalized Linear Model (GLM). Supports Gaussian, binomial, Poisson, gamma, and Tweedie distributions.
H2OGeneralizedAdditiveEstimator
Generalized Additive Model (GAM).
H2OKMeansEstimator
K-Means clustering (unsupervised).
H2OPrincipalComponentAnalysisEstimator
Principal Component Analysis (PCA).
H2OStackedEnsembleEstimator
Stacked Ensemble. Combines multiple base models into a metalearner.
H2ONaiveBayesEstimator
Naive Bayes classifier.
Common methods
All estimators inherit these methods fromH2OEstimator.
train
Column names or indices to use as predictors. When
None, all columns except y are used.Column name or index of the response variable.
The frame containing training data.
Column to use as an offset (added to the linear predictor before applying the link function).
Column containing per-row cross-validation fold assignments.
Column containing per-row observation weights. Rows with weight
0 are excluded from training.Optional frame to score against during training.
Maximum training time in seconds.
0 disables the limit.Additional column names to exclude from training.
Custom model ID. Auto-generated if not specified.
Print scoring history to stdout during training.
predict
H2OFrame with prediction columns.
The frame to score. Must contain the same predictor columns used during training.
predict column (the predicted class) and probability columns for each class (p0, p1, etc.).
model_performance
test_data is provided the metrics are computed on that set; otherwise metrics from training, validation, or cross-validation data are returned.
Dataset to evaluate. Takes precedence over
train, valid, and xval flags.Return training metrics.
Return validation metrics.
Return cross-validation metrics.
GBM hyperparameters
The following parameters are specific toH2OGradientBoostingEstimator and illustrate the depth of configuration available.
Tree structure
Tree structure
Number of trees to build.
Maximum depth of each tree.
0 for unlimited.Minimum number of (weighted) observations in a leaf node.
Number of histogram bins for numeric columns.
Number of histogram bins for categorical columns.
Learning rate and regularization
Learning rate and regularization
Shrinkage factor applied to each tree’s contribution. Lower values require more trees.
Reduce
learn_rate by this factor after each tree. Values less than 1.0 slow the rate over time.Row sampling rate per tree (stochastic GBM). Values between
0.0 and 1.0.Column sampling rate per split level.
Column sampling rate per tree.
Minimum relative improvement in squared error needed to split.
Early stopping
Early stopping
Number of scoring rounds with no improvement before stopping.
0 disables early stopping.Metric used for early stopping. One of:
"auto", "deviance", "logloss", "mse", "rmse", "mae", "auc", "misclassification".Relative improvement threshold required to avoid early stopping.
Cross-validation
Cross-validation
Number of folds for k-fold cross-validation.
0 disables CV; minimum useful value is 2.Cross-validation fold assignment scheme. One of:
"auto", "random", "modulo", "stratified".Retain cross-validation sub-models after training.
Retain cross-validation holdout predictions.
Distribution and response
Distribution and response