Training Models

H2O-3 supports training supervised models (classification and regression) and unsupervised models (clustering, anomaly detection, dimensionality reduction). All algorithms share a common training interface through the train() method in Python and named function calls in R.

The train() method

The train() method (Python) and algorithm-specific functions like h2o.glm(), h2o.gbm() (R) accept a consistent set of parameters across all supervised algorithms.

Key parameters

Parameter	Description
`x`	A list of column names or indices for predictor columns.
`y`	The name or index of the response column.
`training_frame`	The H2OFrame to train on.
`validation_frame`	An optional H2OFrame for validation scoring during training.
`weights_column`	Name of the column containing per-row observation weights.
`fold_column`	Name of the column containing fold assignment for cross-validation.
`nfolds`	Number of cross-validation folds. Default is 0 (disabled).
`seed`	Seed for random number generation to ensure reproducibility.
`score_each_iteration`	Whether to score on a validation set at each iteration.
`max_runtime_secs`	Maximum allowed runtime in seconds. Use 0 to disable.

Splitting datasets

Use split_frame() (Python) or h2o.splitFrame() (R) to divide an H2OFrame into train and test sets before training.

import h2o
h2o.init()

df = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")

# Split into 80% train, 20% test
train, test = df.split_frame(ratios=[0.8], seed=1234)

For a three-way split (train/validation/test), pass two ratios: ratios=[0.6, 0.2] in Python or ratios=c(0.6, 0.2) in R.

Classification example

H2O determines whether to perform classification or regression based on the type of the response column. For classification, the column must be a factor (categorical) type. For regression, it must be numeric.

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()

# Import the prostate dataset
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")

# Convert response to factor for classification
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()

# Define predictors and response
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
response = "CAPSULE"

# Split into train and test sets
train, test = prostate.split_frame(ratios=[0.8], seed=1234)

# Train a GLM model
glm_model = H2OGeneralizedLinearEstimator(family="binomial", lambda_=0, compute_p_values=True)
glm_model.train(predictors, response, training_frame=train)

# Predict on the test set
predict = glm_model.predict(test)
predict.head()

Regression example

For regression, the response column must be numeric. The following example uses the Boston Housing dataset to predict median home prices.

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()

# Import the Boston Housing dataset
boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")

# Set predictors and response
predictors = boston.columns[:-1]
response = "medv"

# Convert chas to a factor (it's a binary indicator)
boston['chas'] = boston['chas'].asfactor()

# Split into train and test sets
train, test = boston.split_frame(ratios=[0.8], seed=1234)

# Train the model
boston_glm = H2OGeneralizedLinearEstimator(alpha=0.25)
boston_glm.train(x=predictors, y=response, training_frame=train)

# Predict on the test set
predict = boston_glm.predict(test)
predict.head()

Common training parameters

These parameters are shared across most H2O-3 algorithms.

Controlling runtime

Use max_runtime_secs to cap training time. This is especially useful in automated pipelines or when exploring many models.

from h2o.estimators.gbm import H2OGradientBoostingEstimator

gbm = H2OGradientBoostingEstimator(
    max_runtime_secs=60,   # stop after 60 seconds
    seed=42
)
gbm.train(x=predictors, y=response, training_frame=train)

Enabling cross-validation during training

Pass nfolds to perform K-fold cross-validation automatically as part of training. See Cross-Validation for details.

gbm = H2OGradientBoostingEstimator(nfolds=5, seed=42)
gbm.train(x=predictors, y=response, training_frame=train)

Accessing model metrics after training

After training, you can retrieve performance metrics on training, validation, and cross-validation data.

# Training metrics
perf_train = gbm.model_performance(train=True)
print(perf_train)

# Validation metrics
perf_valid = gbm.model_performance(valid=True)
print(perf_valid)

# Cross-validation metrics (if nfolds > 0)
perf_xval = gbm.model_performance(xval=True)
print(perf_xval)

# Access a specific metric (e.g., AUC for binary classification)
print(gbm.auc(train=True))
print(gbm.auc(xval=True))

Training on segments

H2O-3 can train a separate model for each subpopulation (segment) of a dataset using train_segments(). This is useful for building per-group models, such as per-region or per-category models.

import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
titanic["survived"] = titanic["survived"].asfactor()

predictors = ["name", "sex", "age", "sibsp", "parch", "ticket", "fare", "cabin"]
response = "survived"

train, valid = titanic.split_frame(ratios=[.8], seed=1234)

# Train one GBM per passenger class
titanic_gbm = H2OGradientBoostingEstimator(seed=1234)
titanic_models = titanic_gbm.train_segments(
    segments=["pclass"],
    x=predictors,
    y=response,
    training_frame=train,
    validation_frame=valid
)

# View results as a frame
titanic_models.as_frame()

The train_segments() function trains one model per segment group. The parallelism parameter controls how many models build simultaneously on each H2O node.

Get Started

Core Concepts

Algorithms

Model Workflows

Deployment

Training Models

The train() method

Key parameters

Splitting datasets

Classification example

Regression example

Common training parameters

Controlling runtime

Enabling cross-validation during training

Accessing model metrics after training

Training on segments

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Model Workflows

Deployment

​The train() method

​Key parameters

​Splitting datasets

​Classification example

​Regression example

​Common training parameters

​Controlling runtime

​Enabling cross-validation during training

​Accessing model metrics after training

​Training on segments

Build docs developers (and LLMs) love

The train() method

Key parameters

Splitting datasets

Classification example

Regression example

Common training parameters

Controlling runtime

Enabling cross-validation during training

Accessing model metrics after training

Training on segments