H2O-3 supports training supervised models (classification and regression) and unsupervised models (clustering, anomaly detection, dimensionality reduction). All algorithms share a common training interface through the train() method in Python and named function calls in R.
The train() method
The train() method (Python) and algorithm-specific functions like h2o.glm(), h2o.gbm() (R) accept a consistent set of parameters across all supervised algorithms.
Key parameters
| Parameter | Description |
|---|
x | A list of column names or indices for predictor columns. |
y | The name or index of the response column. |
training_frame | The H2OFrame to train on. |
validation_frame | An optional H2OFrame for validation scoring during training. |
weights_column | Name of the column containing per-row observation weights. |
fold_column | Name of the column containing fold assignment for cross-validation. |
nfolds | Number of cross-validation folds. Default is 0 (disabled). |
seed | Seed for random number generation to ensure reproducibility. |
score_each_iteration | Whether to score on a validation set at each iteration. |
max_runtime_secs | Maximum allowed runtime in seconds. Use 0 to disable. |
Splitting datasets
Use split_frame() (Python) or h2o.splitFrame() (R) to divide an H2OFrame into train and test sets before training.
import h2o
h2o.init()
df = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
# Split into 80% train, 20% test
train, test = df.split_frame(ratios=[0.8], seed=1234)
For a three-way split (train/validation/test), pass two ratios: ratios=[0.6, 0.2] in Python or ratios=c(0.6, 0.2) in R.
Classification example
H2O determines whether to perform classification or regression based on the type of the response column. For classification, the column must be a factor (categorical) type. For regression, it must be numeric.
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()
# Import the prostate dataset
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
# Convert response to factor for classification
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()
# Define predictors and response
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
response = "CAPSULE"
# Split into train and test sets
train, test = prostate.split_frame(ratios=[0.8], seed=1234)
# Train a GLM model
glm_model = H2OGeneralizedLinearEstimator(family="binomial", lambda_=0, compute_p_values=True)
glm_model.train(predictors, response, training_frame=train)
# Predict on the test set
predict = glm_model.predict(test)
predict.head()
Regression example
For regression, the response column must be numeric. The following example uses the Boston Housing dataset to predict median home prices.
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()
# Import the Boston Housing dataset
boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
# Set predictors and response
predictors = boston.columns[:-1]
response = "medv"
# Convert chas to a factor (it's a binary indicator)
boston['chas'] = boston['chas'].asfactor()
# Split into train and test sets
train, test = boston.split_frame(ratios=[0.8], seed=1234)
# Train the model
boston_glm = H2OGeneralizedLinearEstimator(alpha=0.25)
boston_glm.train(x=predictors, y=response, training_frame=train)
# Predict on the test set
predict = boston_glm.predict(test)
predict.head()
Common training parameters
These parameters are shared across most H2O-3 algorithms.
Controlling runtime
Use max_runtime_secs to cap training time. This is especially useful in automated pipelines or when exploring many models.
from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm = H2OGradientBoostingEstimator(
max_runtime_secs=60, # stop after 60 seconds
seed=42
)
gbm.train(x=predictors, y=response, training_frame=train)
Enabling cross-validation during training
Pass nfolds to perform K-fold cross-validation automatically as part of training. See Cross-Validation for details.
gbm = H2OGradientBoostingEstimator(nfolds=5, seed=42)
gbm.train(x=predictors, y=response, training_frame=train)
Accessing model metrics after training
After training, you can retrieve performance metrics on training, validation, and cross-validation data.
# Training metrics
perf_train = gbm.model_performance(train=True)
print(perf_train)
# Validation metrics
perf_valid = gbm.model_performance(valid=True)
print(perf_valid)
# Cross-validation metrics (if nfolds > 0)
perf_xval = gbm.model_performance(xval=True)
print(perf_xval)
# Access a specific metric (e.g., AUC for binary classification)
print(gbm.auc(train=True))
print(gbm.auc(xval=True))
Training on segments
H2O-3 can train a separate model for each subpopulation (segment) of a dataset using train_segments(). This is useful for building per-group models, such as per-region or per-category models.
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
titanic["survived"] = titanic["survived"].asfactor()
predictors = ["name", "sex", "age", "sibsp", "parch", "ticket", "fare", "cabin"]
response = "survived"
train, valid = titanic.split_frame(ratios=[.8], seed=1234)
# Train one GBM per passenger class
titanic_gbm = H2OGradientBoostingEstimator(seed=1234)
titanic_models = titanic_gbm.train_segments(
segments=["pclass"],
x=predictors,
y=response,
training_frame=train,
validation_frame=valid
)
# View results as a frame
titanic_models.as_frame()
The train_segments() function trains one model per segment group. The parallelism parameter controls how many models build simultaneously on each H2O node.