Skip to main content

Generalized Linear Model (GLM)

GLM estimates regression models for outcomes following exponential family distributions. In addition to Gaussian (normal) regression, GLM covers binomial (logistic), multinomial, Poisson, Gamma, Tweedie, and other distributions. H2O’s GLM includes elastic net regularization (a combination of L1/lasso and L2/ridge penalties), making it effective for high-dimensional, sparse data. MOJO Support: GLM supports importing and exporting MOJOs.

Supported Distribution Families

FamilyResponse TypeTypical Use Case
gaussianNumericStandard regression
binomialBinary (0/1)Logistic regression / binary classification
quasibinomialProportion [0,1]Rates and proportions
fractional_binomialFraction [0,1]Bounded continuous outcomes
multinomialMulti-classMulticlass classification
poissonNon-negative integerCount data
gammaPositive numericInsurance claim severity, duration
tweedieNon-negative numericMixed zero/positive outcomes (insurance)
negative_binomialNon-negative integerOverdispersed count data
ordinalOrdered categoriesOrdinal classification

Generalized Additive Model (GAM)

GAM is a type of GLM where the linear predictor includes smooth functions of one or more predictor variables. H2O’s GAM implementation is based on Simon N. Wood’s “Generalized Additive Models: An Introduction with R.” GAM is useful when the relationship between a predictor and the response is non-linear but you want interpretable, smooth curves rather than a black-box model.
GAM models are currently experimental in H2O-3. GAM inherits all GLM parameters and adds spline-specific controls.
MOJO Support: GAM supports importing and exporting MOJOs.

Key Parameters

GLM Parameters

family
str
default:"auto"
Distribution family. One of: "gaussian", "binomial", "quasibinomial", "fractional_binomial", "multinomial", "poisson", "gamma", "tweedie", "negative_binomial", "ordinal". Set to "auto" to infer from the response column type.
alpha
List[float]
default:"[0.5]"
Elastic net mixing parameter array. alpha=0 gives ridge (L2-only), alpha=1 gives lasso (L1-only). Values between 0 and 1 blend both penalties. Pass a list of values to perform a regularization path search.
lambda_
List[float]
default:"(computed)"
Regularization strength. Larger values produce more regularization. If not specified, H2O computes a regularization path automatically. Enable lambda_search=True to search across a path of lambda values automatically.
When True, H2O performs a full regularization path search from lambda_max to lambda_min. Highly recommended when you don’t know the right regularization strength.
solver
str
default:"AUTO"
Optimization algorithm: "AUTO", "IRLSM" (Iteratively Reweighted Least Squares, good for small/medium wide data), "L_BFGS" (large sparse), "COORDINATE_DESCENT" (multi-threaded CD, best for large data), "COORDINATE_DESCENT_NAIVE", "GRADIENT_DESCENT_LH", "GRADIENT_DESCENT_SQERR".
standardize
bool
default:"True"
Standardize numeric columns before fitting. Strongly recommended when using regularization so that coefficients are on a comparable scale.
remove_collinear_columns
bool
default:"False"
Automatically drop collinear columns. Useful when multicollinearity would otherwise prevent convergence.
compute_p_values
bool
default:"False"
Compute p-values and standard errors for coefficients. Only available for models without regularization (lambda_=0).
tweedie_variance_power
float
default:"0.0"
(Only for family="tweedie") The variance power p. Common values: 1.0 = Poisson, 1.5 = compound Poisson-Gamma, 2.0 = Gamma.

GAM Parameters

gam_columns
List[str]
required
Column names to use as smoothing terms. GAM builds a spline smoother for each column listed. Required for GAM models.
bs
List[int]
default:"[0]"
Spline type per GAM column: 0 = cubic regression spline (default), 1 = thin plate regression with knots, 2 = monotone I-splines, 3 = NBSplineTypeI M-splines.
num_knots
List[int]
Number of knots for each GAM predictor listed in gam_columns. One value per column.
scale
List[float]
Smoothing parameter for each GAM predictor. Must be the same length as gam_columns.

Code Examples

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

h2o.init()

train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_5k.csv")
test  = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

y = "response"
x = [c for c in train.columns if c != y]
train[y] = train[y].asfactor()
test[y]  = test[y].asfactor()

glm = H2OGeneralizedLinearEstimator(
    family="binomial",
    alpha=0.5,           # elastic net: mix of L1 and L2
    lambda_search=True,  # search for best lambda automatically
    standardize=True,
    seed=42
)
glm.train(x=x, y=y, training_frame=train, validation_frame=test)

# Coefficients
print(glm.coef())
print(glm.auc(valid=True))

Regularization Paths

GLM supports full elastic net regularization paths. The path runs from a fully regularized model (all coefficients zero) to the least regularized model, selecting the optimal lambda via cross-validation or a validation frame.
Python
# Enable lambda search to automatically find optimal regularization
glm = H2OGeneralizedLinearEstimator(
    family="binomial",
    alpha=0.5,
    lambda_search=True,
    nlambdas=100,          # number of lambda values to try
    lambda_min_ratio=1e-4, # ratio of smallest to largest lambda
)
glm.train(x=x, y=y, training_frame=train)

# Optimal lambda selected
print("Best lambda:", glm.actual_params["lambda"])

# Full scoring history across the regularization path
sh = glm.scoring_history()
For high-dimensional data (many predictors), use alpha=1.0 (lasso) with lambda_search=True for automatic feature selection — coefficients for irrelevant features are driven to exactly zero. For highly correlated features, alpha=0.0 (ridge) or an intermediate alpha value often works better.

Build docs developers (and LLMs) love