Skip to main content
All H2O supervised and unsupervised algorithms are implemented as estimator classes that extend H2OEstimator. Every estimator follows the same pattern: instantiate with hyperparameters, then call train() with a dataset.
from h2o.estimators.gbm import H2OGradientBoostingEstimator

model = H2OGradientBoostingEstimator(ntrees=100, max_depth=5, learn_rate=0.1)
model.train(x=predictors, y=response, training_frame=train)

Available estimators

H2OGradientBoostingEstimator

Gradient Boosting Machine (GBM). Supports regression and classification.
from h2o.estimators.gbm import H2OGradientBoostingEstimator

H2OXGBoostEstimator

XGBoost integration. Requires the XGBoost extension.
from h2o.estimators.xgboost import H2OXGBoostEstimator

H2ORandomForestEstimator

Distributed Random Forest (DRF).
from h2o.estimators.random_forest import H2ORandomForestEstimator

H2ODeepLearningEstimator

Fully-connected deep neural network.
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

H2OGeneralizedLinearEstimator

Generalized Linear Model (GLM). Supports Gaussian, binomial, Poisson, gamma, and Tweedie distributions.
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

H2OGeneralizedAdditiveEstimator

Generalized Additive Model (GAM).
from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator

H2OKMeansEstimator

K-Means clustering (unsupervised).
from h2o.estimators.kmeans import H2OKMeansEstimator

H2OPrincipalComponentAnalysisEstimator

Principal Component Analysis (PCA).
from h2o.estimators.pca import H2OPrincipalComponentAnalysisEstimator

H2OStackedEnsembleEstimator

Stacked Ensemble. Combines multiple base models into a metalearner.
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator

H2ONaiveBayesEstimator

Naive Bayes classifier.
from h2o.estimators.naive_bayes import H2ONaiveBayesEstimator

Common methods

All estimators inherit these methods from H2OEstimator.

train

estimator.train(
    x=None, y=None,
    training_frame=None,
    offset_column=None,
    fold_column=None,
    weights_column=None,
    validation_frame=None,
    max_runtime_secs=None,
    ignored_columns=None,
    model_id=None,
    verbose=False
)
Train the model on the provided dataset.
x
string[] | integer[]
Column names or indices to use as predictors. When None, all columns except y are used.
y
string | integer
required
Column name or index of the response variable.
training_frame
H2OFrame
required
The frame containing training data.
offset_column
string
Column to use as an offset (added to the linear predictor before applying the link function).
fold_column
string
Column containing per-row cross-validation fold assignments.
weights_column
string
Column containing per-row observation weights. Rows with weight 0 are excluded from training.
validation_frame
H2OFrame
Optional frame to score against during training.
max_runtime_secs
float
Maximum training time in seconds. 0 disables the limit.
ignored_columns
string[]
Additional column names to exclude from training.
model_id
string
Custom model ID. Auto-generated if not specified.
verbose
boolean
default:"False"
Print scoring history to stdout during training.
model.train(x=["age", "salary"], y="target", training_frame=train)

predict

predictions = estimator.predict(test_data)
Generate predictions for a new dataset. Returns an H2OFrame with prediction columns.
test_data
H2OFrame
required
The frame to score. Must contain the same predictor columns used during training.
pred = model.predict(test)
pred.head()
For classification models the returned frame includes a predict column (the predicted class) and probability columns for each class (p0, p1, etc.).

model_performance

perf = estimator.model_performance(test_data=None, train=False, valid=False, xval=False)
Return model metrics. When test_data is provided the metrics are computed on that set; otherwise metrics from training, validation, or cross-validation data are returned.
test_data
H2OFrame
Dataset to evaluate. Takes precedence over train, valid, and xval flags.
train
boolean
default:"False"
Return training metrics.
valid
boolean
default:"False"
Return validation metrics.
xval
boolean
default:"False"
Return cross-validation metrics.
perf = model.model_performance(test_data=test)
print(perf.auc())
print(perf.mse())

GBM hyperparameters

The following parameters are specific to H2OGradientBoostingEstimator and illustrate the depth of configuration available.
ntrees
integer
default:"50"
Number of trees to build.
max_depth
integer
default:"5"
Maximum depth of each tree. 0 for unlimited.
min_rows
float
default:"10.0"
Minimum number of (weighted) observations in a leaf node.
nbins
integer
default:"20"
Number of histogram bins for numeric columns.
nbins_cats
integer
default:"1024"
Number of histogram bins for categorical columns.
learn_rate
float
default:"0.1"
Shrinkage factor applied to each tree’s contribution. Lower values require more trees.
learn_rate_annealing
float
default:"1.0"
Reduce learn_rate by this factor after each tree. Values less than 1.0 slow the rate over time.
sample_rate
float
default:"1.0"
Row sampling rate per tree (stochastic GBM). Values between 0.0 and 1.0.
col_sample_rate
float
default:"1.0"
Column sampling rate per split level.
col_sample_rate_per_tree
float
default:"1.0"
Column sampling rate per tree.
min_split_improvement
float
default:"0.00001"
Minimum relative improvement in squared error needed to split.
stopping_rounds
integer
default:"0"
Number of scoring rounds with no improvement before stopping. 0 disables early stopping.
stopping_metric
string
default:"auto"
Metric used for early stopping. One of: "auto", "deviance", "logloss", "mse", "rmse", "mae", "auc", "misclassification".
stopping_tolerance
float
default:"0.001"
Relative improvement threshold required to avoid early stopping.
nfolds
integer
default:"0"
Number of folds for k-fold cross-validation. 0 disables CV; minimum useful value is 2.
fold_assignment
string
default:"auto"
Cross-validation fold assignment scheme. One of: "auto", "random", "modulo", "stratified".
keep_cross_validation_models
boolean
default:"True"
Retain cross-validation sub-models after training.
keep_cross_validation_predictions
boolean
default:"False"
Retain cross-validation holdout predictions.
distribution
string
default:"auto"
Distribution family. One of: "auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber".
balance_classes
boolean
default:"False"
Balance class distribution by over/under-sampling for imbalanced classification problems.

Full example

import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator

h2o.init()

# Load data
titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
titanic["survived"] = titanic["survived"].asfactor()

predictors = ["pclass", "sex", "age", "sibsp", "parch", "fare", "cabin"]
response = "survived"

train, test = titanic.split_frame(ratios=[0.8], seed=42)

# Train
model = H2OGradientBoostingEstimator(
    ntrees=100,
    max_depth=5,
    learn_rate=0.05,
    nfolds=5,
    seed=42,
)
model.train(x=predictors, y=response, training_frame=train)

# Evaluate
perf = model.model_performance(test)
print(perf.auc())

# Predict
pred = model.predict(test)
pred.head()

Build docs developers (and LLMs) love