tanh, rectifier, and maxout activation functions.
Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously) and contributes periodically to the global model via model averaging across the network.
H2O-3’s Deep Learning is optimized for tabular (transactional) data. For image data, consider CNNs; for sequential data (text, time-series), consider RNNs. H2O Deep Learning works well when GBM and DRF are under-performing due to highly non-linear feature interactions or very large datasets.
Key Parameters
The architecture of hidden layers. Each value specifies the number of neurons in that layer. For example,
[200, 200] creates a 2-layer network with 200 neurons each. Deeper networks (more layers) can capture more complex patterns.Number of passes through the training data. Can be fractional. More epochs improve accuracy but risk overfitting. Use
overwrite_with_best_model=True (default) to keep the best checkpoint.Activation function for hidden layers:
"tanh"— hyperbolic tangent"tanh_with_dropout"— tanh with dropout regularization"rectifier"(default) — ReLU; fast and effective for most tasks"rectifier_with_dropout"— ReLU with dropout"maxout"— max over linear functions; powerful but slow"maxout_with_dropout"— maxout with dropout
Enable ADADELTA adaptive learning rate. When
True (default), the rate, rate_annealing, and rate_decay parameters are ignored. Recommended for most use cases.Manual learning rate (only used when
adaptive_rate=False). Higher values converge faster but can be unstable.ADADELTA time decay factor (only used when
adaptive_rate=True). Controls how much past gradient information is retained.ADADELTA smoothing factor (only used when
adaptive_rate=True). Prevents division by zero.Fraction of input features to randomly drop during training. Suggested values:
0.1 or 0.2. Helps prevent overfitting on noisy datasets.Per-layer hidden dropout ratios. Only applicable when using a
*_with_dropout activation. Must have one value per hidden layer. Range: [0, 1).L1 regularization. Sets many weights to exactly zero, producing sparse networks. Increase to reduce overfitting.
L2 regularization. Shrinks weights toward zero without setting them exactly to zero. Increase to reduce overfitting.
Loss function:
"automatic" infers from the response type. Options: "cross_entropy" (classification), "quadratic", "huber", "absolute", "quantile".Code Examples
Regularization
Deep Learning models are prone to overfitting, especially on small datasets. H2O-3 provides multiple complementary regularization techniques:- Dropout
- L1 / L2 Regularization
- Early Stopping
Dropout randomly sets a fraction of neuron activations to zero during each training pass, forcing the network to learn redundant representations.
Dropout ratios apply only when using
"tanh_with_dropout", "rectifier_with_dropout", or "maxout_with_dropout" activation. Setting hidden_dropout_ratios with a non-dropout activation has no effect.Tips for Deep Learning in H2O-3
- Start with the default architecture
hidden=[200, 200]andactivation="rectifier". Tune from there. - Use
adaptive_rate=True(default) unless you want manual control over the learning schedule. - Grid search over
hiddenlayer sizes andl1/l2regularization for best results. - For large datasets, increase
train_samples_per_iterationor set it to-2(auto-tuning). - Export weights and biases for analysis via
export_weights_and_biases=True.