K-Means Clustering
K-Means falls in the general category of clustering algorithms. It partitions a set of observations intok groups such that observations within a group are more similar to each other than to observations in other groups. H2O’s K-Means runs in a distributed, parallel fashion across the cluster.
MOJO Support: K-Means supports exporting MOJOs (export only, not import).
Key Parameters
Number of clusters. This is the most important parameter. Use domain knowledge or the elbow method to choose an appropriate value.
Cluster center initialization strategy:
"Furthest"(default) — choose each subsequent center to be farthest from existing centers (Euclidean distance)"PlusPlus"— weighted random selection where points farther from existing centers are more likely to be chosen (K-Means++ algorithm)"Random"— uniformly random sample ofkrows as initial centers"User"— provide explicit initial centers viauser_points
Maximum number of iterations for the K-Means algorithm to converge.
Automatically estimate the number of clusters (up to
k) by iteratively trying k=1, 2, 3, .... When enabled, the init parameter is ignored.Standardize numeric columns before clustering. Strongly recommended so that features with larger numeric ranges don’t dominate the distance calculation.
Random seed for reproducibility.
(Only when
init="User") A data frame where each row is an initial cluster center. Must have the same number of columns as the training frame.Code Examples
Choosing k — Elbow Method
Python
Principal Component Analysis (PCA)
PCA performs a transformation on possibly collinear features to produce a new set of uncorrelated (orthogonal) features called principal components. The components are ordered by the amount of variance they explain, so the first few components capture the most information. Common uses:- Dimensionality reduction — reduce a high-dimensional dataset to
kprincipal components before feeding into a supervised model. - Preprocessing — orthogonalize features before distance-based algorithms (like K-Means).
- Visualization — project data onto 2–3 components for plotting.
Key Parameters
Number of principal components (rank of the matrix approximation). Must be between 1 and
min(nrows, ncols).Numeric column transformation before computing PCA:
"none"(default) — no transformation"standardize"— subtract mean and divide by standard deviation (recommended in most cases)"normalize"— scale to [0, 1]"demean"— subtract the mean only"descale"— divide by standard deviation only
Algorithm for computing principal components:
"gram_s_v_d"(default) — distributed Gram matrix + local SVD via JAMA"power"— power iteration method (experimental)"randomized"— randomized subspace iteration (fast for very large matrices)"glrm"— generalized low-rank model approach (experimental)
Implementation for EVD/SVD computation:
"mtj_evd_symmmatrix"(default) — EVD for symmetric matrix via MTJ"mtj_evd_densematrix"— EVD for dense matrix via MTJ"mtj_svd_densematrix"— SVD for dense matrix via MTJ"jama"— EVD for dense matrix via JAMA
Maximum number of iterations for iterative PCA methods (
power, randomized, glrm).Whether to use all factor levels when expanding categorical columns into indicators. By default the first level is skipped to avoid collinearity.
Code Examples
PCA as a Preprocessing Step
Python
Word2Vec
H2O-3 includes a Word2Vec implementation for generating dense word embeddings from text data. Word2Vec learns vector representations of words such that words appearing in similar contexts have similar vectors.Python
Word2Vec in H2O-3 operates on pre-tokenized data stored as a single-column H2OFrame of words. Use
h2o.H2OFrame.tokenize() to split raw text into tokens before training.