Connect to H2O-3, import data, train models, and run predictions from R using the h2o package.
The H2O R package exposes H2O-3’s distributed machine learning capabilities through a REST-backed R API. All operations run on the H2O server; R acts as a thin client that sends commands and receives results.
h2o.init() connects to a running H2O instance or starts a new local one. It checks that the R package version matches the server version.
# Start or connect to a local instance with defaults (localhost:54321)h2o.init()# Allocate more memory for larger datasetsh2o.init(max_mem_size = "8g")# Connect to a remote H2O clusterh2o.init(ip = "192.168.1.100", port = 54321, startH2O = FALSE)# Disable automatic startup (fail if no instance is found)h2o.init(startH2O = FALSE)
Reads data from a path on the H2O server’s filesystem. This is the recommended method for large datasets because data is read in parallel without passing through the R client.
# Import a CSV from the server filesystemtrain <- h2o.importFile("/data/train.csv")# Import with explicit column typestrain <- h2o.importFile( path = "/data/train.csv", header = TRUE, sep = ",", col.types = list(by.col.name = c("label"), types = c("Enum")))# Import all CSV files in a directorytrain <- h2o.importFile("/data/train_parts/")
h2o.nrow(train) # Number of rowsh2o.ncol(train) # Number of columnsh2o.dim(train) # c(nrow, ncol)h2o.colnames(train) # Column namesh2o.summary(train) # Summary statistics per columnh2o.describe(train) # Detailed type and missing-value infohead(train, n = 10) # First 10 rows as an R data frame
# Create a new columntrain["log_psa"] <- log(train["PSA"])# Rename columnsnames(train)[1] <- "id"# Check and set column typesh2o.getTypes(train)train["CAPSULE"] <- as.factor(train["CAPSULE"])
# Performance on training datatrain_perf <- h2o.performance(gbm_model, train = TRUE)# Performance on a test settest_perf <- h2o.performance(gbm_model, newdata = test)# Key metricsh2o.auc(test_perf)h2o.logloss(test_perf)h2o.mse(test_perf)h2o.rmse(test_perf)h2o.r2(test_perf)# Confusion matrix (classification)h2o.confusionMatrix(test_perf)
# Save to a directory on the H2O server filesystemmodel_path <- h2o.saveModel( object = gbm_model, path = "/models/", force = TRUE)print(model_path)# e.g., "/models/GBM_model_R_1234567890"