Prerequisites
- At least 6 GB of memory allocated per H2O node
- Open communication on the H2O ports (default 54321 and 54322 TCP)
- The
h2odriver.jarfor your Hadoop distribution
Running H2O on YARN/Hadoop
Download the H2O Hadoop distribution
Download the H2O build for your Hadoop version from h2o.ai/download. Supported distributions include CDH 5.x, HDP 2.x, and MapR 4.x/5.x, as well as AWS EMR.
Launch H2O on the cluster
-nodes— number of H2O nodes (mappers) to launch-mapperXmx— heap size per node (minimum6g; recommended: 4× your dataset size)-output— unique HDFS output directory name (must be unique per run)
Hadoop launch parameters reference
Cluster sizing
Cluster sizing
| Parameter | Description |
|---|---|
-nodes <N> | Number of H2O nodes to launch |
-mapperXmx <Xg> | Heap size per mapper (minimum 6g) |
-extramempercent <0-20> | Extra JVM memory as a percentage of -mapperXmx |
-nthreads <N> | CPUs per node; -1 uses all available |
Networking
Networking
| Parameter | Description |
|---|---|
-baseport <port> | Starting port for H2O nodes (default 54321) |
-driverport <port> | Port for mapper-to-driver callback |
-driverportrange <portX-portY> | Allowed range for driver callback port |
-network <CIDR> | Bind H2O to a specific IPv4 network, e.g. 10.1.2.0/24 |
Security and identity
Security and identity
| Parameter | Description |
|---|---|
-principal <kerberos principal> | Kerberos principal for authentication |
-keytab <path> | Path to Kerberos keytab |
-run_as_user <username> | Start cluster on behalf of a Hadoop user (non-Kerberos clusters) |
-jks <filename> | Java keystore for TLS |
-jks_pass <password> | Keystore password (default h2oh2o) |
Output and flow
Output and flow
| Parameter | Description |
|---|---|
-output <HDFS dir> | Required. HDFS output directory (must be unique per run) |
-flow_dir <dir> | Directory for saving H2O Flow notebooks on HDFS |
-notify <file> | Write the cluster’s IP:port to a file when ready |
-disown | Exit the driver after the cluster forms |
-timeout <seconds> | Time to wait for cluster formation (default 120) |
Accessing S3 data from Hadoop
H2O launched on YARN can read from S3 as well as HDFS. Configure S3 access in Hadoop’score-site.xml and set HADOOP_CONF_DIR to point to that directory before launching H2O.
No changes to H2O’s launch command are needed — S3 access flows through Hadoop’s standard credential chain once
core-site.xml is configured.Sparkling Water: H2O on Apache Spark
Sparkling Water runs an H2O cluster inside a Spark application. H2O and Spark share the same JVMs, enabling zero-copy data exchange between H2O frames and Spark DataFrames.Starting Sparkling Water
Scala
Data exchange between H2O and Spark
TLS with Sparkling Water
Cloud integrations
AWS EMR
H2O provides a pre-built distribution for Amazon EMR. Download the EMR-specific build from the H2O download page, then launch with the standardhadoop jar h2odriver.jar command from the EMR master node.
For S3-backed data, assign an IAM role with S3 read permissions to the EMR cluster — no additional configuration is needed.
Azure HDInsight
Use the HDInsight-compatible H2O distribution. Launch from the HDInsight head node with thehadoop jar command. Azure Blob Storage (WASB) can be accessed the same way as S3 once the cluster’s core-site.xml is configured with the WASB connector.
GCP Dataproc
H2O runs on Dataproc as a YARN application. Use a Dataproc initialization action script to install H2O on all nodes, then submit thehadoop jar h2odriver.jar command from the master.
Dataproc init action
GCS data access uses the Cloud Storage connector bundled with Dataproc — no extra H2O configuration needed beyond standard Hadoop GCS setup.