Clustering & Dimensionality Reduction

H2O-3 includes several unsupervised algorithms for discovering structure in data without labeled responses: K-Means for partitioning observations into clusters, PCA for reducing dimensionality while preserving variance, and Word2Vec for generating word embeddings from text.

K-Means Clustering

K-Means falls in the general category of clustering algorithms. It partitions a set of observations into k groups such that observations within a group are more similar to each other than to observations in other groups. H2O’s K-Means runs in a distributed, parallel fashion across the cluster. MOJO Support: K-Means supports exporting MOJOs (export only, not import).

Key Parameters

int

default:"1"

Number of clusters. This is the most important parameter. Use domain knowledge or the elbow method to choose an appropriate value.

init

str

default:"Furthest"

Cluster center initialization strategy:

"Furthest" (default) — choose each subsequent center to be farthest from existing centers (Euclidean distance)
"PlusPlus" — weighted random selection where points farther from existing centers are more likely to be chosen (K-Means++ algorithm)
"Random" — uniformly random sample of k rows as initial centers
"User" — provide explicit initial centers via user_points

max_iterations

int

default:"10"

Maximum number of iterations for the K-Means algorithm to converge.

estimate_k

bool

default:"False"

Automatically estimate the number of clusters (up to k) by iteratively trying k=1, 2, 3, .... When enabled, the init parameter is ignored.

standardize

bool

default:"True"

Standardize numeric columns before clustering. Strongly recommended so that features with larger numeric ranges don’t dominate the distance calculation.

seed

int

default:"-1"

Random seed for reproducibility.

user_points

H2OFrame

(Only when init="User") A data frame where each row is an initial cluster center. Must have the same number of columns as the training frame.

Code Examples

import h2o
from h2o.estimators.kmeans import H2OKMeansEstimator

h2o.init()

# Load iris dataset (classic clustering example)
iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_wheader.csv")

# K-Means with k=3 clusters
kmeans = H2OKMeansEstimator(
    k=3,
    init="PlusPlus",
    max_iterations=100,
    standardize=True,
    seed=42
)
kmeans.train(
    x=["sepal_len", "sepal_wid", "petal_len", "petal_wid"],
    training_frame=iris
)

# Cluster centers
print(kmeans.centers())

# Within-cluster sum of squares
print(kmeans.tot_withinss())

# Predict cluster assignments
assignments = kmeans.predict(iris)
print(assignments.head())

Choosing k — Elbow Method

Python

withinss = []
k_range = range(1, 11)

for k in k_range:
    m = H2OKMeansEstimator(k=k, init="PlusPlus", seed=42)
    m.train(x=["sepal_len", "sepal_wid", "petal_len", "petal_wid"], training_frame=iris)
    withinss.append(m.tot_withinss())

# Plot withinss vs k to find the "elbow"
import matplotlib.pyplot as plt
plt.plot(list(k_range), withinss, "bo-")
plt.xlabel("k"); plt.ylabel("Total Within-Cluster SS")
plt.title("K-Means Elbow Curve")
plt.show()

Principal Component Analysis (PCA)

PCA performs a transformation on possibly collinear features to produce a new set of uncorrelated (orthogonal) features called principal components. The components are ordered by the amount of variance they explain, so the first few components capture the most information. Common uses:

Dimensionality reduction — reduce a high-dimensional dataset to k principal components before feeding into a supervised model.
Preprocessing — orthogonalize features before distance-based algorithms (like K-Means).
Visualization — project data onto 2–3 components for plotting.

MOJO Support: PCA supports exporting MOJOs (export only).

Key Parameters

int

default:"1"

Number of principal components (rank of the matrix approximation). Must be between 1 and min(nrows, ncols).

transform

str

default:"none"

Numeric column transformation before computing PCA:

"none" (default) — no transformation
"standardize" — subtract mean and divide by standard deviation (recommended in most cases)
"normalize" — scale to [0, 1]
"demean" — subtract the mean only
"descale" — divide by standard deviation only

pca_method

str

default:"gram_s_v_d"

Algorithm for computing principal components:

"gram_s_v_d" (default) — distributed Gram matrix + local SVD via JAMA
"power" — power iteration method (experimental)
"randomized" — randomized subspace iteration (fast for very large matrices)
"glrm" — generalized low-rank model approach (experimental)

pca_impl

str

default:"mtj_evd_symmmatrix"

Implementation for EVD/SVD computation:

"mtj_evd_symmmatrix" (default) — EVD for symmetric matrix via MTJ
"mtj_evd_densematrix" — EVD for dense matrix via MTJ
"mtj_svd_densematrix" — SVD for dense matrix via MTJ
"jama" — EVD for dense matrix via JAMA

max_iterations

int

default:"1000"

Maximum number of iterations for iterative PCA methods (power, randomized, glrm).

use_all_factor_levels

bool

default:"False"

Whether to use all factor levels when expanding categorical columns into indicators. By default the first level is skipped to avoid collinearity.

Code Examples

import h2o
from h2o.estimators.pca import H2OPrincipalComponentAnalysisEstimator

h2o.init()

train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_wheader.csv")

pca = H2OPrincipalComponentAnalysisEstimator(
    k=2,
    transform="standardize",
    pca_method="gram_s_v_d",
    seed=42
)
pca.train(
    x=["sepal_len", "sepal_wid", "petal_len", "petal_wid"],
    training_frame=train
)

# Proportion of variance explained
print(pca.varimp())

# Project data onto principal components
projected = pca.predict(train)
print(projected.head())

PCA as a Preprocessing Step

Python

# Step 1: Fit PCA on training data
pca = H2OPrincipalComponentAnalysisEstimator(k=10, transform="standardize")
pca.train(x=feature_cols, training_frame=train)

# Step 2: Transform train and test
train_pca = pca.predict(train)
test_pca  = pca.predict(test)

# Step 3: Append response column
train_pca["response"] = train["response"]
test_pca["response"]  = test["response"]

# Step 4: Train a supervised model on the reduced features
from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm = H2OGradientBoostingEstimator(ntrees=100)
gbm.train(y="response", training_frame=train_pca, validation_frame=test_pca)

Word2Vec

H2O-3 includes a Word2Vec implementation for generating dense word embeddings from text data. Word2Vec learns vector representations of words such that words appearing in similar contexts have similar vectors.

Python

from h2o.estimators.word2vec import H2OWord2vecEstimator

# Assume `words` is an H2OFrame with a single column of tokenized words
w2v = H2OWord2vecEstimator(
    vec_size=100,       # embedding dimension
    window_size=5,      # context window
    epochs=5,
    sent_sample_rate=0.0,
    init_learning_rate=0.025,
    min_word_freq=5,
)
w2v.train(training_frame=words)

# Get the embedding for a word
embedding = w2v.transform(h2o.H2OFrame(["cat"]), aggregate_method="Average")

# Aggregate sentence-level embeddings (average word vectors per row)
sentence_embeddings = w2v.transform(sentence_frame, aggregate_method="Average")

Word2Vec in H2O-3 operates on pre-tokenized data stored as a single-column H2OFrame of words. Use h2o.H2OFrame.tokenize() to split raw text into tokens before training.

Get Started

Core Concepts

Algorithms

Model Workflows

Deployment

Clustering & Dimensionality Reduction

K-Means Clustering

Key Parameters

Code Examples

Choosing k — Elbow Method

Principal Component Analysis (PCA)

Key Parameters

Code Examples

PCA as a Preprocessing Step

Word2Vec

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Model Workflows

Deployment

​K-Means Clustering

​Key Parameters

​Code Examples

​Choosing k — Elbow Method

​Principal Component Analysis (PCA)

​Key Parameters

​Code Examples

​PCA as a Preprocessing Step

​Word2Vec

Build docs developers (and LLMs) love

K-Means Clustering

Key Parameters

Code Examples

Choosing k — Elbow Method

Principal Component Analysis (PCA)

Key Parameters

Code Examples

PCA as a Preprocessing Step

Word2Vec