Estimators

All H2O supervised and unsupervised algorithms are implemented as estimator classes that extend H2OEstimator. Every estimator follows the same pattern: instantiate with hyperparameters, then call train() with a dataset.

from h2o.estimators.gbm import H2OGradientBoostingEstimator

model = H2OGradientBoostingEstimator(ntrees=100, max_depth=5, learn_rate=0.1)
model.train(x=predictors, y=response, training_frame=train)

Available estimators

H2OGradientBoostingEstimator

Gradient Boosting Machine (GBM). Supports regression and classification.

from h2o.estimators.gbm import H2OGradientBoostingEstimator

H2OXGBoostEstimator

XGBoost integration. Requires the XGBoost extension.

from h2o.estimators.xgboost import H2OXGBoostEstimator

H2ORandomForestEstimator

Distributed Random Forest (DRF).

from h2o.estimators.random_forest import H2ORandomForestEstimator

H2ODeepLearningEstimator

Fully-connected deep neural network.

from h2o.estimators.deeplearning import H2ODeepLearningEstimator

H2OGeneralizedLinearEstimator

Generalized Linear Model (GLM). Supports Gaussian, binomial, Poisson, gamma, and Tweedie distributions.

from h2o.estimators.glm import H2OGeneralizedLinearEstimator

H2OGeneralizedAdditiveEstimator

Generalized Additive Model (GAM).

from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator

H2OKMeansEstimator

K-Means clustering (unsupervised).

from h2o.estimators.kmeans import H2OKMeansEstimator

H2OPrincipalComponentAnalysisEstimator

Principal Component Analysis (PCA).

from h2o.estimators.pca import H2OPrincipalComponentAnalysisEstimator

H2OStackedEnsembleEstimator

Stacked Ensemble. Combines multiple base models into a metalearner.

from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator

H2ONaiveBayesEstimator

Naive Bayes classifier.

from h2o.estimators.naive_bayes import H2ONaiveBayesEstimator

Common methods

All estimators inherit these methods from H2OEstimator.

train

estimator.train(
    x=None, y=None,
    training_frame=None,
    offset_column=None,
    fold_column=None,
    weights_column=None,
    validation_frame=None,
    max_runtime_secs=None,
    ignored_columns=None,
    model_id=None,
    verbose=False
)

Train the model on the provided dataset.

string[] | integer[]

Column names or indices to use as predictors. When None, all columns except y are used.

string | integer

required

Column name or index of the response variable.

training_frame

H2OFrame

required

The frame containing training data.

offset_column

string

Column to use as an offset (added to the linear predictor before applying the link function).

fold_column

string

Column containing per-row cross-validation fold assignments.

weights_column

string

Column containing per-row observation weights. Rows with weight 0 are excluded from training.

validation_frame

H2OFrame

Optional frame to score against during training.

max_runtime_secs

float

Maximum training time in seconds. 0 disables the limit.

ignored_columns

string[]

Additional column names to exclude from training.

model_id

string

Custom model ID. Auto-generated if not specified.

verbose

boolean

default:"False"

Print scoring history to stdout during training.

model.train(x=["age", "salary"], y="target", training_frame=train)

predict

predictions = estimator.predict(test_data)

Generate predictions for a new dataset. Returns an H2OFrame with prediction columns.

test_data

H2OFrame

required

The frame to score. Must contain the same predictor columns used during training.

pred = model.predict(test)
pred.head()

For classification models the returned frame includes a predict column (the predicted class) and probability columns for each class (p0, p1, etc.).

model_performance

perf = estimator.model_performance(test_data=None, train=False, valid=False, xval=False)

Return model metrics. When test_data is provided the metrics are computed on that set; otherwise metrics from training, validation, or cross-validation data are returned.

test_data

H2OFrame

Dataset to evaluate. Takes precedence over train, valid, and xval flags.

train

boolean

default:"False"

Return training metrics.

valid

boolean

default:"False"

Return validation metrics.

xval

boolean

default:"False"

Return cross-validation metrics.

perf = model.model_performance(test_data=test)
print(perf.auc())
print(perf.mse())

GBM hyperparameters

The following parameters are specific to H2OGradientBoostingEstimator and illustrate the depth of configuration available.

Tree structure

ntrees

integer

default:"50"

Number of trees to build.

max_depth

integer

default:"5"

Maximum depth of each tree. 0 for unlimited.

min_rows

float

default:"10.0"

Minimum number of (weighted) observations in a leaf node.

nbins

integer

default:"20"

Number of histogram bins for numeric columns.

nbins_cats

integer

default:"1024"

Number of histogram bins for categorical columns.

Learning rate and regularization

learn_rate

float

default:"0.1"

Shrinkage factor applied to each tree’s contribution. Lower values require more trees.

learn_rate_annealing

float

default:"1.0"

Reduce learn_rate by this factor after each tree. Values less than 1.0 slow the rate over time.

sample_rate

float

default:"1.0"

Row sampling rate per tree (stochastic GBM). Values between 0.0 and 1.0.

col_sample_rate

float

default:"1.0"

Column sampling rate per split level.

col_sample_rate_per_tree

float

default:"1.0"

Column sampling rate per tree.

min_split_improvement

float

default:"0.00001"

Minimum relative improvement in squared error needed to split.

Early stopping

stopping_rounds

integer

default:"0"

Number of scoring rounds with no improvement before stopping. 0 disables early stopping.

stopping_metric

string

default:"auto"

Metric used for early stopping. One of: "auto", "deviance", "logloss", "mse", "rmse", "mae", "auc", "misclassification".

stopping_tolerance

float

default:"0.001"

Relative improvement threshold required to avoid early stopping.

Cross-validation

nfolds

integer

default:"0"

Number of folds for k-fold cross-validation. 0 disables CV; minimum useful value is 2.

fold_assignment

string

default:"auto"

Cross-validation fold assignment scheme. One of: "auto", "random", "modulo", "stratified".

keep_cross_validation_models

boolean

default:"True"

Retain cross-validation sub-models after training.

keep_cross_validation_predictions

boolean

default:"False"

Retain cross-validation holdout predictions.

Distribution and response

distribution

string

default:"auto"

Distribution family. One of: "auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber".

balance_classes

boolean

default:"False"

Balance class distribution by over/under-sampling for imbalanced classification problems.

Full example

import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator

h2o.init()

# Load data
titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
titanic["survived"] = titanic["survived"].asfactor()

predictors = ["pclass", "sex", "age", "sibsp", "parch", "fare", "cabin"]
response = "survived"

train, test = titanic.split_frame(ratios=[0.8], seed=42)

# Train
model = H2OGradientBoostingEstimator(
    ntrees=100,
    max_depth=5,
    learn_rate=0.05,
    nfolds=5,
    seed=42,
)
model.train(x=predictors, y=response, training_frame=train)

# Evaluate
perf = model.model_performance(test)
print(perf.auc())

# Predict
pred = model.predict(test)
pred.head()

Python API

R API

REST API

Available estimators

H2OGradientBoostingEstimator

H2OXGBoostEstimator

H2ORandomForestEstimator

H2ODeepLearningEstimator

H2OGeneralizedLinearEstimator

H2OGeneralizedAdditiveEstimator

H2OKMeansEstimator

H2OPrincipalComponentAnalysisEstimator

H2OStackedEnsembleEstimator

H2ONaiveBayesEstimator

Common methods

train

predict

model_performance

GBM hyperparameters

Full example

Build docs developers (and LLMs) love

Python API

R API

REST API

​Available estimators

H2OGradientBoostingEstimator

H2OXGBoostEstimator

H2ORandomForestEstimator

H2ODeepLearningEstimator

H2OGeneralizedLinearEstimator

H2OGeneralizedAdditiveEstimator

H2OKMeansEstimator

H2OPrincipalComponentAnalysisEstimator

H2OStackedEnsembleEstimator

H2ONaiveBayesEstimator

​Common methods

​train

​predict

​model_performance

​GBM hyperparameters

​Full example

Build docs developers (and LLMs) love

Available estimators

Common methods

train

predict

model_performance

GBM hyperparameters

Full example