Module structure
H2O-3 is built from layered modules. Each layer depends only on the ones below it.The
h2o-genmodel module has no dependency on the H2O runtime, which makes it suitable for embedding POJO/MOJO models in production systems without running a cluster.| Module | Responsibility |
|---|---|
h2o-core | DKV, REST API infrastructure, Frame/Vec/Chunk data structures, MRTask framework |
h2o-algos | All ML algorithms (each extends hex.ModelBuilder) |
h2o-web | Flow web UI (Node.js, compiled into resources) |
h2o-genmodel | Standalone model scoring — no H2O runtime dependencies |
h2o-bindings | Generates Python and R client code from REST schemas |
h2o-persist-{hdfs,s3,gcs} | Storage backends for distributed file systems |
Distributed Key-Value store (DKV)
Every object in H2O-3 — frames, models, jobs — lives in the DKV, a distributed in-memory key-value store spread across all cluster nodes.- Each object has a home node determined by consistent hashing of its
Key. - Reads and writes use
DKV.get(key)andDKV.put(key, value). - The cluster locks via Paxos before the first DKV write, preventing node joins mid-computation.
Vec / Chunk / Frame data model
H2O-3 stores tabular data using a three-level hierarchy.- A Frame is the user-visible table (rows × columns).
- A Vec is a single distributed column, analogous to a database column.
- A Chunk is a contiguous block of rows within a Vec stored on a single node.
VectorGroup, which guarantees chunk alignment: chunk i of column A covers exactly the same row range as chunk i of column B. This makes row-wise iteration across columns efficient without any shuffling.
MRTask map-reduce framework
MRTask is H2O-3’s in-memory map-reduce framework. It is distinct from Hadoop MapReduce — it operates entirely within the JVM heap across cluster nodes.
To write a distributed computation:
- Extend
MRTaskand overridemap(Chunk c). - Optionally override
reduce(MRTask mrt)to aggregate results. - Call
.doAll(frame)(blocking) or.dfork(frame)(non-blocking) to execute.
Computation moves to the data. Each chunk is processed on the node where it lives, and partial results reduce up a binary tree back to the calling node.
Node communication
H2O-3 nodes communicate over two channels:| Channel | Used for |
|---|---|
| UDP | Heartbeats, small control messages, cluster membership |
| TCP | Bulk data transfer (frame data, model serialization) |
REST API structure
All client interactions (Python, R, Flow, Excel) go through H2O-3’s versioned REST API. The server follows a Handler → Route → Schema pattern.Handler
A
Handler processes the request. Methods have the signature (int version, SchemaType schema).Iced serialization
All distributed objects extendIced<T> for auto-generated Java serialization. Keyed<T> extends Iced and adds DKV key management. Schemas also extend Iced and serve as versioned REST API data transfer objects.
Flow web UI
Flow is H2O-3’s notebook-style web interface. It is a JavaScript application bundled with the H2O JAR and served athttp://<host>:54321/flow/index.html.
From Flow you can:
- Import and inspect data
- Build and tune models interactively
- Monitor running jobs and cluster health
- Visualize model metrics and predictions
How clients interact with H2O-3
The Python and R packages are thin REST clients. Data never flows through the client. AnH2OFrame object in Python or R is a handle — a reference to data that lives in the cluster.
Deployment targets
H2O-3 runs in several environments. All use the sameh2o.jar artifact.
Standalone
Single-node or multi-node flat network. Launch with
java -jar h2o.jar. Nodes discover each other via multicast or -flatfile.Hadoop (YARN)
Launch on an existing Hadoop cluster. H2O nodes run as YARN containers. Supports CDH, HDP, MapR, and EMR.
Spark (Sparkling Water)
Embed H2O inside a Spark application. H2O nodes run as Spark executors, enabling data sharing between Spark DataFrames and H2O Frames.
Kubernetes
Deploy using the
h2o-open-source-k8s Docker image. Nodes use the H2O Kubernetes operator or a headless service for discovery.