Skip to main content
H2O-3 supports training supervised models (classification and regression) and unsupervised models (clustering, anomaly detection, dimensionality reduction). All algorithms share a common training interface through the train() method in Python and named function calls in R.

The train() method

The train() method (Python) and algorithm-specific functions like h2o.glm(), h2o.gbm() (R) accept a consistent set of parameters across all supervised algorithms.

Key parameters

ParameterDescription
xA list of column names or indices for predictor columns.
yThe name or index of the response column.
training_frameThe H2OFrame to train on.
validation_frameAn optional H2OFrame for validation scoring during training.
weights_columnName of the column containing per-row observation weights.
fold_columnName of the column containing fold assignment for cross-validation.
nfoldsNumber of cross-validation folds. Default is 0 (disabled).
seedSeed for random number generation to ensure reproducibility.
score_each_iterationWhether to score on a validation set at each iteration.
max_runtime_secsMaximum allowed runtime in seconds. Use 0 to disable.

Splitting datasets

Use split_frame() (Python) or h2o.splitFrame() (R) to divide an H2OFrame into train and test sets before training.
import h2o
h2o.init()

df = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")

# Split into 80% train, 20% test
train, test = df.split_frame(ratios=[0.8], seed=1234)
For a three-way split (train/validation/test), pass two ratios: ratios=[0.6, 0.2] in Python or ratios=c(0.6, 0.2) in R.

Classification example

H2O determines whether to perform classification or regression based on the type of the response column. For classification, the column must be a factor (categorical) type. For regression, it must be numeric.
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()

# Import the prostate dataset
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")

# Convert response to factor for classification
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()

# Define predictors and response
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
response = "CAPSULE"

# Split into train and test sets
train, test = prostate.split_frame(ratios=[0.8], seed=1234)

# Train a GLM model
glm_model = H2OGeneralizedLinearEstimator(family="binomial", lambda_=0, compute_p_values=True)
glm_model.train(predictors, response, training_frame=train)

# Predict on the test set
predict = glm_model.predict(test)
predict.head()

Regression example

For regression, the response column must be numeric. The following example uses the Boston Housing dataset to predict median home prices.
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()

# Import the Boston Housing dataset
boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")

# Set predictors and response
predictors = boston.columns[:-1]
response = "medv"

# Convert chas to a factor (it's a binary indicator)
boston['chas'] = boston['chas'].asfactor()

# Split into train and test sets
train, test = boston.split_frame(ratios=[0.8], seed=1234)

# Train the model
boston_glm = H2OGeneralizedLinearEstimator(alpha=0.25)
boston_glm.train(x=predictors, y=response, training_frame=train)

# Predict on the test set
predict = boston_glm.predict(test)
predict.head()

Common training parameters

These parameters are shared across most H2O-3 algorithms.

Controlling runtime

Use max_runtime_secs to cap training time. This is especially useful in automated pipelines or when exploring many models.
from h2o.estimators.gbm import H2OGradientBoostingEstimator

gbm = H2OGradientBoostingEstimator(
    max_runtime_secs=60,   # stop after 60 seconds
    seed=42
)
gbm.train(x=predictors, y=response, training_frame=train)

Enabling cross-validation during training

Pass nfolds to perform K-fold cross-validation automatically as part of training. See Cross-Validation for details.
gbm = H2OGradientBoostingEstimator(nfolds=5, seed=42)
gbm.train(x=predictors, y=response, training_frame=train)

Accessing model metrics after training

After training, you can retrieve performance metrics on training, validation, and cross-validation data.
# Training metrics
perf_train = gbm.model_performance(train=True)
print(perf_train)

# Validation metrics
perf_valid = gbm.model_performance(valid=True)
print(perf_valid)

# Cross-validation metrics (if nfolds > 0)
perf_xval = gbm.model_performance(xval=True)
print(perf_xval)

# Access a specific metric (e.g., AUC for binary classification)
print(gbm.auc(train=True))
print(gbm.auc(xval=True))

Training on segments

H2O-3 can train a separate model for each subpopulation (segment) of a dataset using train_segments(). This is useful for building per-group models, such as per-region or per-category models.
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
titanic["survived"] = titanic["survived"].asfactor()

predictors = ["name", "sex", "age", "sibsp", "parch", "ticket", "fare", "cabin"]
response = "survived"

train, valid = titanic.split_frame(ratios=[.8], seed=1234)

# Train one GBM per passenger class
titanic_gbm = H2OGradientBoostingEstimator(seed=1234)
titanic_models = titanic_gbm.train_segments(
    segments=["pclass"],
    x=predictors,
    y=response,
    training_frame=train,
    validation_frame=valid
)

# View results as a frame
titanic_models.as_frame()
The train_segments() function trains one model per segment group. The parallelism parameter controls how many models build simultaneously on each H2O node.

Build docs developers (and LLMs) love