Skip to main content
H2O’s Deep Learning is based on a multi-layer feedforward artificial neural network (also called a deep neural network, DNN, or multi-layer perceptron, MLP) trained with stochastic gradient descent using backpropagation. The network supports large numbers of hidden layers with tanh, rectifier, and maxout activation functions. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously) and contributes periodically to the global model via model averaging across the network.
H2O-3’s Deep Learning is optimized for tabular (transactional) data. For image data, consider CNNs; for sequential data (text, time-series), consider RNNs. H2O Deep Learning works well when GBM and DRF are under-performing due to highly non-linear feature interactions or very large datasets.
MOJO Support: Deep Learning supports importing and exporting MOJOs.

Key Parameters

hidden
List[int]
default:"[200, 200]"
The architecture of hidden layers. Each value specifies the number of neurons in that layer. For example, [200, 200] creates a 2-layer network with 200 neurons each. Deeper networks (more layers) can capture more complex patterns.
epochs
float
default:"10.0"
Number of passes through the training data. Can be fractional. More epochs improve accuracy but risk overfitting. Use overwrite_with_best_model=True (default) to keep the best checkpoint.
activation
str
default:"rectifier"
Activation function for hidden layers:
  • "tanh" — hyperbolic tangent
  • "tanh_with_dropout" — tanh with dropout regularization
  • "rectifier" (default) — ReLU; fast and effective for most tasks
  • "rectifier_with_dropout" — ReLU with dropout
  • "maxout" — max over linear functions; powerful but slow
  • "maxout_with_dropout" — maxout with dropout
adaptive_rate
bool
default:"True"
Enable ADADELTA adaptive learning rate. When True (default), the rate, rate_annealing, and rate_decay parameters are ignored. Recommended for most use cases.
rate
float
default:"0.005"
Manual learning rate (only used when adaptive_rate=False). Higher values converge faster but can be unstable.
rho
float
default:"0.99"
ADADELTA time decay factor (only used when adaptive_rate=True). Controls how much past gradient information is retained.
epsilon
float
default:"1e-08"
ADADELTA smoothing factor (only used when adaptive_rate=True). Prevents division by zero.
input_dropout_ratio
float
default:"0.0"
Fraction of input features to randomly drop during training. Suggested values: 0.1 or 0.2. Helps prevent overfitting on noisy datasets.
hidden_dropout_ratios
List[float]
default:"0.5 per layer"
Per-layer hidden dropout ratios. Only applicable when using a *_with_dropout activation. Must have one value per hidden layer. Range: [0, 1).
l1
float
default:"0.0"
L1 regularization. Sets many weights to exactly zero, producing sparse networks. Increase to reduce overfitting.
l2
float
default:"0.0"
L2 regularization. Shrinks weights toward zero without setting them exactly to zero. Increase to reduce overfitting.
loss
str
default:"automatic"
Loss function: "automatic" infers from the response type. Options: "cross_entropy" (classification), "quadratic", "huber", "absolute", "quantile".

Code Examples

import h2o
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

h2o.init()

train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_5k.csv")
test  = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

y = "response"
x = [c for c in train.columns if c != y]
train[y] = train[y].asfactor()
test[y]  = test[y].asfactor()

dl = H2ODeepLearningEstimator(
    hidden=[200, 200],
    epochs=20,
    activation="rectifier_with_dropout",
    hidden_dropout_ratios=[0.2, 0.2],
    input_dropout_ratio=0.1,
    l1=1e-5,
    l2=1e-5,
    adaptive_rate=True,
    stopping_rounds=5,
    stopping_metric="AUC",
    overwrite_with_best_model=True,
    seed=42
)
dl.train(x=x, y=y, training_frame=train, validation_frame=test)
print(dl.auc(valid=True))

Regularization

Deep Learning models are prone to overfitting, especially on small datasets. H2O-3 provides multiple complementary regularization techniques:
Dropout randomly sets a fraction of neuron activations to zero during each training pass, forcing the network to learn redundant representations.
dl = H2ODeepLearningEstimator(
    hidden=[500, 500],
    activation="rectifier_with_dropout",
    input_dropout_ratio=0.1,          # drop 10% of inputs
    hidden_dropout_ratios=[0.5, 0.5], # drop 50% in each hidden layer
)
Dropout ratios apply only when using "tanh_with_dropout", "rectifier_with_dropout", or "maxout_with_dropout" activation. Setting hidden_dropout_ratios with a non-dropout activation has no effect.

Tips for Deep Learning in H2O-3

Deep Learning models in H2O-3 are non-deterministic by default for performance reasons (asynchronous multi-threaded gradient updates). To force reproducibility, set reproducible=True — but note this forces single-threaded training and will be significantly slower.
  • Start with the default architecture hidden=[200, 200] and activation="rectifier". Tune from there.
  • Use adaptive_rate=True (default) unless you want manual control over the learning schedule.
  • Grid search over hidden layer sizes and l1/l2 regularization for best results.
  • For large datasets, increase train_samples_per_iteration or set it to -2 (auto-tuning).
  • Export weights and biases for analysis via export_weights_and_biases=True.

Build docs developers (and LLMs) love