K-fold cross-validation estimates model performance without requiring a separate validation split. H2O-3 integrates cross-validation directly into the training process via the nfolds parameter.
How cross-validation works in H2O-3
When you set nfolds=5, H2O-3 builds 6 models:
- 5 cross-validation models, each trained on 80% of the data with a different 20% held out.
- 1 main model, trained on 100% of the training data.
The main model is what you get back from H2O-3. Its cross-validation metrics are computed by combining the 5 holdout prediction sets into a single prediction for the full training dataset — where the model making a prediction for a given row never saw that row during training.
Combining holdout predictions this way can yield slightly different results than averaging the 5 individual validation metrics, especially when fold sizes differ or when models converge to different local minima (e.g., small Deep Learning models).
Basic example
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import the prostate dataset
prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
# Set response and predictors
response = "CAPSULE"
predictors = prostate.names[3:8]
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
# Train a GBM with 5-fold cross-validation
prostate_gbm = H2OGradientBoostingEstimator(nfolds=5, seed=1)
prostate_gbm.train(x=predictors, y=response, training_frame=prostate)
# AUC of the combined cross-validated holdout predictions
prostate_gbm.auc(xval=True)
Custom fold assignment with fold_column
For structured data — such as time-series, geographic, or grouped data — random fold splitting can cause data leakage. Use fold_column to specify an explicit per-row fold assignment.
If rows from the same group appear in both training and holdout folds (due to random splitting), your cross-validation metrics will be overly optimistic. Use fold_column to prevent this.
from h2o.estimators.gbm import H2OGradientBoostingEstimator
# Assume df already has a "city" column to segment by
gbm = H2OGradientBoostingEstimator(seed=42)
gbm.train(
x=predictors,
y=response,
training_frame=df,
fold_column="city" # each city forms one fold
)
Keeping cross-validation predictions
By default, holdout predictions and fold assignments are deleted from memory after training completes. Set keep_cross_validation_predictions=True to retain the per-fold prediction frames.
import h2o
from h2o.estimators.kmeans import H2OKMeansEstimator
h2o.init()
prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
predictors = prostate.names[2:9]
prostate_kmeans = H2OKMeansEstimator(
k=10,
nfolds=5,
keep_cross_validation_predictions=True
)
prostate_kmeans.train(x=predictors, training_frame=prostate)
# List of per-fold prediction frames (one per fold)
prostate_kmeans.cross_validation_predictions()
# Single combined holdout prediction for the full training frame
prostate_kmeans.cross_validation_holdout_predictions()
Keeping fold assignments
To retain the fold assignment column used during cross-validation, set keep_cross_validation_fold_assignment=True.
gbm = H2OGradientBoostingEstimator(
nfolds=5,
keep_cross_validation_fold_assignment=True,
seed=42
)
gbm.train(x=predictors, y=response, training_frame=train)
# Retrieve the fold assignment frame
fold_assignment = gbm.cross_validation_fold_assignment()
Accessing cross-validation metrics
Train with nfolds
Set nfolds greater than 1 when calling train(). H2O-3 will automatically build the K cross-validation models along with the main model.
Retrieve cross-validation performance
Access the cross-validated metrics from the main model object using xval=True (Python) or xval = TRUE (R).# Overall cross-validation metrics
xval_perf = gbm.model_performance(xval=True)
print(xval_perf)
# Specific metrics
print(gbm.auc(xval=True))
print(gbm.rmse(xval=True))
Inspect individual CV models
Retrieve each cross-validation model to inspect per-fold metrics and diagnose variance.# Get the individual cross-validation models
cv_models = gbm.cross_validation_models()
# Inspect validation metrics for each fold model
for m in cv_models:
print(m.model_performance(valid=True).auc())
Cross-validation cleanup
By default, when the main model finishes training, H2O-3 automatically removes these objects from memory:
- Cross-validation models
- Cross-validation metrics
- Holdout predictions
- Fold assignments
Use the following flags to prevent deletion:
| Flag | Purpose |
|---|
keep_cross_validation_models | Retain the K fold models |
keep_cross_validation_predictions | Retain per-fold and combined holdout prediction frames |
keep_cross_validation_fold_assignment | Retain the fold assignment column |
If training is interrupted by a timeout or manual cancellation, H2O-3 also attempts to remove the associated CV models and residuals from memory.