Skip to main content
H2O can run as a YARN application on Hadoop clusters, giving it access to HDFS data and cluster resources. For Spark environments, Sparkling Water embeds H2O inside a Spark context.

Prerequisites

  • At least 6 GB of memory allocated per H2O node
  • Open communication on the H2O ports (default 54321 and 54322 TCP)
  • The h2odriver.jar for your Hadoop distribution
Each H2O node runs as a single mapper. Run only one mapper per physical host — YARN may place two mappers on the same host if resources are tight, so open a range of at least 20 ports around 54321 to allow adaptive port selection.

Running H2O on YARN/Hadoop

1

Download the H2O Hadoop distribution

Download the H2O build for your Hadoop version from h2o.ai/download. Supported distributions include CDH 5.x, HDP 2.x, and MapR 4.x/5.x, as well as AWS EMR.
unzip h2o-<version>-*.zip
cd h2o-<version>-*
2

Launch H2O on the cluster

hadoop jar h2odriver.jar -nodes 3 -mapperXmx 6g -output hdfsOutputDirName
Key flags:
  • -nodes — number of H2O nodes (mappers) to launch
  • -mapperXmx — heap size per node (minimum 6g; recommended: 4× your dataset size)
  • -output — unique HDFS output directory name (must be unique per run)
When the cluster forms, you will see:
H2O cluster (3 nodes) is up
Open H2O Flow in your web browser: http://172.16.2.184:54321
3

Connect from Python or R

import h2o
h2o.init(ip="172.16.2.184", port=54321)

Hadoop launch parameters reference

ParameterDescription
-nodes <N>Number of H2O nodes to launch
-mapperXmx <Xg>Heap size per mapper (minimum 6g)
-extramempercent <0-20>Extra JVM memory as a percentage of -mapperXmx
-nthreads <N>CPUs per node; -1 uses all available
ParameterDescription
-baseport <port>Starting port for H2O nodes (default 54321)
-driverport <port>Port for mapper-to-driver callback
-driverportrange <portX-portY>Allowed range for driver callback port
-network <CIDR>Bind H2O to a specific IPv4 network, e.g. 10.1.2.0/24
ParameterDescription
-principal <kerberos principal>Kerberos principal for authentication
-keytab <path>Path to Kerberos keytab
-run_as_user <username>Start cluster on behalf of a Hadoop user (non-Kerberos clusters)
-jks <filename>Java keystore for TLS
-jks_pass <password>Keystore password (default h2oh2o)
ParameterDescription
-output <HDFS dir>Required. HDFS output directory (must be unique per run)
-flow_dir <dir>Directory for saving H2O Flow notebooks on HDFS
-notify <file>Write the cluster’s IP:port to a file when ready
-disownExit the driver after the cluster forms
-timeout <seconds>Time to wait for cluster formation (default 120)

Accessing S3 data from Hadoop

H2O launched on YARN can read from S3 as well as HDFS. Configure S3 access in Hadoop’s core-site.xml and set HADOOP_CONF_DIR to point to that directory before launching H2O.
No changes to H2O’s launch command are needed — S3 access flows through Hadoop’s standard credential chain once core-site.xml is configured.

Sparkling Water: H2O on Apache Spark

Sparkling Water runs an H2O cluster inside a Spark application. H2O and Spark share the same JVMs, enabling zero-copy data exchange between H2O frames and Spark DataFrames.

Starting Sparkling Water

$SPARK_HOME/bin/sparkling-shell \
    --conf spark.executor.memory=4g \
    --conf spark.driver.memory=2g
From the Spark shell:
Scala
import org.apache.spark.h2o._
val hc = H2OContext.getOrCreate()
println(hc.flowURL())  // Opens H2O Flow UI

Data exchange between H2O and Spark

// Convert Spark DataFrame to H2O Frame
val sparkDF = spark.read.parquet("hdfs://data/train.parquet")
val h2oFrame = hc.asH2OFrame(sparkDF)
Train with H2O for its algorithm performance (especially GBM and AutoML), then convert predictions back to Spark DataFrames to integrate with your existing Spark pipelines.

TLS with Sparkling Water

$SPARK_HOME/bin/spark-submit \
    --class water.SparklingWaterDriver \
    --conf spark.ext.h2o.jks=/path/to/h2o.jks \
    --conf spark.ext.h2o.jks.pass=mypassword \
    sparkling-water-assembly-all.jar

Cloud integrations

AWS EMR

H2O provides a pre-built distribution for Amazon EMR. Download the EMR-specific build from the H2O download page, then launch with the standard hadoop jar h2odriver.jar command from the EMR master node. For S3-backed data, assign an IAM role with S3 read permissions to the EMR cluster — no additional configuration is needed.

Azure HDInsight

Use the HDInsight-compatible H2O distribution. Launch from the HDInsight head node with the hadoop jar command. Azure Blob Storage (WASB) can be accessed the same way as S3 once the cluster’s core-site.xml is configured with the WASB connector.

GCP Dataproc

H2O runs on Dataproc as a YARN application. Use a Dataproc initialization action script to install H2O on all nodes, then submit the hadoop jar h2odriver.jar command from the master.
Dataproc init action
#!/bin/bash
HADOOP_VERSION=$(hadoop version | head -1 | awk '{print $2}')
wget -q "https://h2o-release.s3.amazonaws.com/h2o/latest_stable_Hadoop.zip" -O /tmp/h2o-hadoop.zip
unzip -q /tmp/h2o-hadoop.zip -d /opt/h2o-hadoop
GCS data access uses the Cloud Storage connector bundled with Dataproc — no extra H2O configuration needed beyond standard Hadoop GCS setup.

Build docs developers (and LLMs) love