Skip to main content
This guide takes you from installation to a trained, scored model using H2O-3’s Python and R APIs. You will load a real dataset, train a GBM model, make predictions, and run AutoML.
H2O-3 requires Java 8 or later on the machine where the cluster runs. See Installation for full prerequisites.

Train a GBM model

1

Install H2O-3

Install the H2O-3 client for your language.
pip install h2o
2

Initialize the H2O cluster

h2o.init() starts a local H2O-3 server if one is not already running, then connects to it. With no arguments, it starts on localhost:54321 and uses all available CPU cores.
import h2o
h2o.init()
You will see a cluster status table printed on success, including the cluster name, number of nodes, available memory, and cores.
You can limit memory with h2o.init(max_mem_size="4G") in Python or h2o.init(max_mem_size = "4g") in R. The recommended rule of thumb is to allocate at least 4× the size of your dataset.
3

Load data

Import the iris dataset from the H2O public test data bucket. h2o.import_file reads data in parallel and distributes it across the cluster.
df = h2o.import_file(
    "https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_wheader.csv"
)
df["class"] = df["class"].asfactor()
df.describe()
4

Split into training and test sets

Use split_frame to create an 80/20 train-test split with a fixed seed for reproducibility.
splits = df.split_frame(ratios=[0.8], seed=1234)
train = splits[0]
test  = splits[1]
5

Train a GBM model

Train a Gradient Boosting Machine for multiclass classification on the class column.
from h2o.estimators.gbm import H2OGradientBoostingEstimator

model = H2OGradientBoostingEstimator(ntrees=50, max_depth=4, seed=1234)
model.train(
    x=["sepal_len", "sepal_wid", "petal_len", "petal_wid"],
    y="class",
    training_frame=train,
    validation_frame=test
)
print(model)
6

Make predictions

Score the test set and inspect model performance metrics.
predictions = model.predict(test)
predictions.head()

# Model performance on the test set
perf = model.model_performance(test)
print(perf)

AutoML quickstart

H2O AutoML automatically trains and tunes many models — including GBMs, XGBoost, Random Forests, Deep Learning, GLMs, and Stacked Ensembles — and ranks them on a leaderboard. Use it when you want the best model without manually tuning hyperparameters.
1

Load data and define the target

Use the prostate dataset for a binary classification task.
import h2o
from h2o.automl import H2OAutoML

h2o.init()

df = h2o.import_file(
    "https://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv"
)
df["CAPSULE"] = df["CAPSULE"].asfactor()

splits = df.split_frame(ratios=[0.8], seed=42)
train  = splits[0]
test   = splits[1]
2

Run AutoML

Set max_models to control how many models to train. AutoML builds base models and stacked ensembles, then ranks them by AUC on the leaderboard.
aml = H2OAutoML(max_models=10, seed=42, project_name="prostate_aml")
aml.train(y="CAPSULE", training_frame=train)

# View the leaderboard
print(aml.leaderboard)
3

Score with the best model

aml.leader (Python) and aml@leader (R) hold the best model from the leaderboard.
predictions = aml.leader.predict(test)
predictions.head()
AutoML stops when max_models is reached or when the optional max_runtime_secs wall-clock limit expires, whichever comes first.

Next steps

Installation

Detailed install instructions for Python, R, conda, and the standalone jar.

Introduction

Architecture overview, supported algorithms, and multi-language API.

Algorithm reference

Deep dive into every algorithm, its parameters, and when to use it.

AutoML

Customize AutoML: exclude algorithms, set stopping criteria, add preprocessing.

Build docs developers (and LLMs) love