Skip to main content
H2O-3 estimator functions follow a consistent interface. All supervised algorithms accept x (predictor column names or indices), y (response column), and training_frame. Unsupervised algorithms omit y.
All estimator functions return an H2O model object. Pass it to h2o.predict() for inference or h2o.performance() for evaluation metrics.

h2o.gbm()

Gradient Boosting Machine — builds an ensemble of shallow decision trees where each tree corrects the residuals of the previous.
model <- h2o.gbm(
  x                = c("AGE", "RACE", "PSA", "VOL", "GLEASON"),
  y                = "CAPSULE",
  training_frame   = train,
  validation_frame = valid,
  ntrees           = 100,
  max_depth        = 5,
  learn_rate       = 0.05,
  sample_rate      = 0.8,
  col_sample_rate  = 0.8,
  nfolds           = 5,
  seed             = 42
)
x
character[]
Predictor column names or indices. If omitted, all columns except y are used.
y
string
required
Response column name or index. Numeric response trains regression; factor response trains classification.
training_frame
H2OFrame
required
Training dataset.
ntrees
number
default:"50"
Number of trees to build.
max_depth
number
default:"5"
Maximum tree depth. Use 0 for unlimited.
learn_rate
number
default:"0.1"
Learning rate (shrinkage). Range: 0.0 to 1.0. Lower values require more trees but often generalize better.
sample_rate
number
default:"1.0"
Row sample rate per tree. Range: 0.0 to 1.0.
col_sample_rate
number
default:"1.0"
Column sample rate per split. Range: 0.0 to 1.0.
nfolds
number
default:"0"
Number of cross-validation folds. 0 disables cross-validation.
distribution
string
default:"AUTO"
Loss distribution. Options: AUTO, bernoulli, multinomial, gaussian, poisson, gamma, tweedie, laplace, quantile, huber.
stopping_rounds
number
default:"0"
Early stopping: stop if the metric does not improve for this many scoring rounds.
min_rows
number
default:"10"
Minimum number of observations in a leaf node.
seed
number
default:"-1"
Random seed for reproducibility. -1 uses a time-based seed.

h2o.xgboost()

XGBoost — uses the native XGBoost backend for gradient boosted trees. Generally faster than h2o.gbm() for single-node workloads.
model <- h2o.xgboost(
  x              = c("AGE", "RACE", "PSA", "VOL", "GLEASON"),
  y              = "CAPSULE",
  training_frame = train,
  ntrees         = 100,
  max_depth      = 6,
  learn_rate     = 0.1,
  sample_rate    = 0.8,
  seed           = 42
)
ntrees
number
default:"50"
Number of trees (also referred to as n_estimators).
max_depth
number
default:"6"
Maximum tree depth.
learn_rate
number
default:"0.1"
Step size shrinkage applied after each boosting step.
sample_rate
number
default:"1.0"
Subsample ratio of the training data for each tree.
col_sample_rate
number
default:"1.0"
Subsample ratio of columns for each tree.
min_rows
number
default:"1"
Minimum number of observations in a leaf (also referred to as min_child_weight).
distribution
string
default:"AUTO"
Loss distribution. Options: AUTO, bernoulli, multinomial, gaussian, poisson, gamma, tweedie, laplace, quantile, huber.
reg_lambda
number
default:"1"
L2 regularization term on leaf weights.
reg_alpha
number
default:"0"
L1 regularization term on leaf weights.

h2o.randomForest()

Distributed Random Forest (DRF) — builds an ensemble of deep, independently-trained decision trees with bootstrap sampling.
model <- h2o.randomForest(
  x              = c("AGE", "RACE", "PSA", "VOL", "GLEASON"),
  y              = "CAPSULE",
  training_frame = train,
  ntrees         = 100,
  max_depth      = 20,
  mtries         = -1,
  sample_rate    = 0.632,
  nfolds         = 5,
  seed           = 42
)
ntrees
number
default:"50"
Number of trees.
max_depth
number
default:"20"
Maximum tree depth. Use 0 for unlimited.
mtries
number
default:"-1"
Number of columns randomly sampled at each split. -1 defaults to sqrt(p) for classification and p/3 for regression, where p is the number of predictors.
sample_rate
number
default:"0.632"
Row sample rate per tree. The default 0.632 matches the classic bootstrap fraction.
binomial_double_trees
boolean
default:"FALSE"
Build twice as many trees for binary classification (one per class). Can improve accuracy at the cost of training time.
min_rows
number
default:"1"
Minimum observations in a leaf node.
nfolds
number
default:"0"
Number of cross-validation folds.
seed
number
default:"-1"
Random seed.

h2o.deeplearning()

Deep Learning (Neural Network) — feed-forward multilayer neural network with adaptive learning rate (ADADELTA by default).
model <- h2o.deeplearning(
  x              = c("AGE", "RACE", "PSA", "VOL", "GLEASON"),
  y              = "CAPSULE",
  training_frame = train,
  hidden         = c(200, 200),
  epochs         = 50,
  activation     = "RectifierWithDropout",
  hidden_dropout_ratios = c(0.2, 0.2),
  l2             = 1e-5,
  seed           = 42
)
hidden
number[]
default:"c(200, 200)"
Hidden layer sizes. Each element specifies the number of neurons in that layer. Example: c(128, 64, 32) builds a 3-hidden-layer network.
epochs
number
default:"10"
Number of passes over the training data. Can be fractional.
activation
string
default:"Rectifier"
Activation function. Options: Tanh, TanhWithDropout, Rectifier, RectifierWithDropout, Maxout, MaxoutWithDropout.
hidden_dropout_ratios
number[]
Dropout rates per hidden layer. Must have the same length as hidden. Example: c(0.2, 0.2).
input_dropout_ratio
number
default:"0"
Dropout ratio for the input layer.
l1
number
default:"0"
L1 regularization. Induces sparsity.
l2
number
default:"0"
L2 regularization. Reduces weight magnitude.
adaptive_rate
boolean
default:"TRUE"
Use ADADELTA adaptive learning rate. Set to FALSE to use a fixed learning rate.
rate
number
default:"0.005"
Learning rate when adaptive_rate = FALSE.
standardize
boolean
default:"TRUE"
Standardize numeric inputs to zero mean and unit variance.
overwrite_with_best_model
boolean
default:"TRUE"
Replace the final model with the best-scoring checkpoint found during training.

h2o.glm()

Generalized Linear Model — fits regularized linear models (Lasso, Ridge, Elastic Net) for regression and classification.
model <- h2o.glm(
  x              = c("AGE", "RACE", "PSA", "VOL", "GLEASON"),
  y              = "CAPSULE",
  training_frame = train,
  family         = "binomial",
  alpha          = 0.5,
  lambda_search  = TRUE,
  nfolds         = 5,
  seed           = 42
)

# View model coefficients
h2o.coef(model)
family
string
default:"AUTO"
Response distribution family. Options: AUTO, gaussian, binomial, multinomial, poisson, gamma, tweedie, negativebinomial, ordinal, quasibinomial, fractionalbinomial.
alpha
number
Elastic net mixing: 0 = Ridge (L2 only), 1 = Lasso (L1 only). Default is 0 for L-BFGS solver, 0.5 otherwise.
lambda
number
Regularization strength. Larger values produce more regularization.
Search for optimal lambda from lambda_max down to lambda. Recommended for finding good regularization.
standardize
boolean
default:"TRUE"
Standardize numeric predictors to zero mean and unit variance before fitting.
solver
string
default:"AUTO"
Optimization algorithm. Options: AUTO, IRLSM, L_BFGS, COORDINATE_DESCENT, COORDINATE_DESCENT_NAIVE.
compute_p_values
boolean
default:"FALSE"
Compute p-values for coefficients. Only works with the IRLSM solver.

h2o.gam()

Generalized Additive Model — extends GLM with smooth spline terms for non-linear effects.
model <- h2o.gam(
  x            = c("AGE", "RACE"),
  y            = "CAPSULE",
  training_frame = train,
  gam_columns  = list(c("PSA"), c("VOL"), c("GLEASON")),
  family       = "binomial",
  seed         = 42
)
gam_columns
list
required
A list of column name vectors specifying which columns to apply GAM smoothers to. Each element can be a single column c("col1") or multiple columns for interaction splines c("col1", "col2").
family
string
default:"AUTO"
Response distribution family. Same options as h2o.glm().
bs
number[]
Spline basis type for each GAM column. 0 = cubic regression spline, 1 = cyclic cubic regression spline.
num_knots
number[]
Number of knots for each GAM column smoother.
alpha
number
Elastic net mixing parameter (same as GLM).
lambda_search
boolean
default:"FALSE"
Perform a lambda search (same as GLM).

h2o.automl()

AutoML — automatically trains and tunes multiple models, then ranks them on a leaderboard.
aml <- h2o.automl(
  x                  = c("AGE", "RACE", "PSA", "VOL", "GLEASON"),
  y                  = "CAPSULE",
  training_frame     = train,
  leaderboard_frame  = test,
  max_models         = 20,
  max_runtime_secs   = 3600,
  seed               = 42
)

# Leaderboard
lb <- h2o.get_leaderboard(aml, extra_columns = "ALL")
print(lb, n = 20)

# Best model
best_model <- aml@leader
preds <- h2o.predict(best_model, newdata = test)
max_models
number
Maximum number of individual models to train (excluding Stacked Ensembles). Setting this guarantees reproducibility.
max_runtime_secs
number
default:"3600"
Maximum wall-clock time for the entire AutoML run in seconds.
max_runtime_secs_per_model
number
default:"0"
Maximum time per individual model. 0 disables the per-model limit.
leaderboard_frame
H2OFrame
Separate holdout frame for leaderboard scoring. If not provided, cross-validation metrics are used.
nfolds
number
default:"5"
Cross-validation folds for individual models. Set to 0 to disable (and use validation_frame instead).
exclude_algos
character[]
Algorithms to skip. Options: "DRF", "GLM", "XGBoost", "GBM", "DeepLearning", "StackedEnsemble".
include_algos
character[]
Restrict to only these algorithms. Cannot be used with exclude_algos.
sort_metric
string
default:"AUTO"
Metric used to rank the leaderboard. Defaults to AUC for binary classification, mean_per_class_error for multinomial, and mean_residual_deviance for regression.
project_name
string
Name for this AutoML run. Models from multiple runs with the same project name are combined into one leaderboard.
seed
number
default:"-1"
Random seed. Set max_models (not max_runtime_secs) for fully reproducible runs.

h2o.kmeans()

K-Means clustering — partitions data into k clusters by minimizing within-cluster sum of squares.
model <- h2o.kmeans(
  training_frame = prostate,
  x              = c("AGE", "RACE", "VOL", "GLEASON"),
  k              = 5,
  max_iterations = 100,
  standardize    = TRUE,
  seed           = 42
)

# Cluster assignments
assignments <- h2o.predict(model, newdata = prostate)

# Cluster centers
h2o.centers(model)
h2o.withinss(model)
k
number
default:"1"
Number of clusters. When estimate_k = TRUE, this is treated as the maximum.
max_iterations
number
default:"10"
Maximum number of Lloyd’s iterations.
standardize
boolean
default:"TRUE"
Standardize columns before computing distances.
init
string
default:"Furthest"
Initialization strategy. Options: Random, PlusPlus, Furthest, User.
estimate_k
boolean
default:"FALSE"
Automatically estimate the number of clusters up to k.
user_points
H2OFrame
A frame with one row per cluster specifying initial centroid positions. Requires init = "User".

h2o.prcomp()

Principal Component Analysis — reduces dimensionality by projecting data onto principal components.
model <- h2o.prcomp(
  training_frame = australia,
  k              = 4,
  transform      = "STANDARDIZE",
  pca_method     = "GramSVD"
)

# Explained variance
summary(model)

# Project data onto components
projected <- h2o.predict(model, newdata = australia)
k
number
default:"1"
Number of principal components to compute.
transform
string
default:"NONE"
Pre-processing transformation. Options: NONE, STANDARDIZE, NORMALIZE, DEMEAN, DESCALE.
pca_method
string
default:"GramSVD"
Algorithm for PCA computation. Options: GramSVD, Power, Randomized, GLRM.
use_all_factor_levels
boolean
default:"FALSE"
Include all levels of categorical columns (no reference level dropped).
impute_missing
boolean
default:"FALSE"
Impute missing values with column mean before PCA.

h2o.stackedEnsemble()

Stacked Ensemble (Super Learner) — combines predictions from multiple base models using a metalearner.
# Train base models with cross-validation
gbm_base <- h2o.gbm(
  x = predictors, y = response, training_frame = train,
  nfolds = 5, keep_cross_validation_predictions = TRUE, seed = 1
)
rf_base <- h2o.randomForest(
  x = predictors, y = response, training_frame = train,
  nfolds = 5, keep_cross_validation_predictions = TRUE, seed = 1
)

# Build stacked ensemble
ensemble <- h2o.stackedEnsemble(
  x              = predictors,
  y              = response,
  training_frame = train,
  base_models    = list(gbm_base, rf_base),
  metalearner_algorithm = "glm"
)

h2o.auc(h2o.performance(ensemble, newdata = test))
base_models
list
required
List of trained H2O model objects or model IDs. Each base model must have been trained with nfolds >= 2 and keep_cross_validation_predictions = TRUE.
metalearner_algorithm
string
default:"AUTO"
Algorithm for the metalearner. Options: AUTO, glm, gbm, drf, deeplearning, naivebayes, xgboost.
metalearner_nfolds
number
default:"0"
Cross-validation folds for the metalearner.
blending_frame
H2OFrame
Optional holdout frame used to train the metalearner instead of cross-validated predictions.
keep_levelone_frame
boolean
default:"FALSE"
Retain the level-one frame (metalearner training data) in the cluster.

Common parameters across estimators

These parameters are available on most supervised estimators.
ParameterTypeDefaultDescription
validation_frameH2OFrameFrame for computing validation metrics during training
nfoldsinteger0K-fold cross-validation folds (0 disables)
fold_assignmentstringAUTOFold assignment: AUTO, Random, Modulo, Stratified
weights_columnstringPer-row observation weights
offset_columnstringOffset added to predictions before the link function
balance_classeslogicalFALSEOver/under-sample to balance class distribution
stopping_roundsinteger0Early stopping patience; 0 disables
stopping_metricstringAUTOMetric for early stopping: AUC, logloss, RMSE, MSE, etc.
stopping_tolerancenumber0.001Minimum relative improvement to continue training
max_runtime_secsnumber0Hard time limit for training; 0 disables
seedinteger-1Random seed; -1 uses time-based seed
model_idstringCustom key to assign this model in the DKV
export_checkpoints_dirstringDirectory to save model checkpoints during training

Build docs developers (and LLMs) love