Skip to main content
H2O-3 organizes tabular data into three nested structures: Frame, Vec, and Chunk. Understanding this hierarchy helps you write efficient code and reason about how data is distributed across a cluster.

The Frame / Vec / Chunk hierarchy

Frame  (table: rows × columns)
 └── Vec  (one per column — distributed across nodes)
      └── Chunk  (contiguous row block — lives on a single node)

H2OFrame

H2OFrame is the primary 2D data structure in H2O-3. It is analogous to a pandas DataFrame or an R data.frame, but the data lives in the H2O cluster, not in client memory. An H2OFrame object in Python or R is a lightweight handle to that remote data.
import h2o
h2o.init()

# From a Python list/dict
frame = h2o.H2OFrame({"x": [1, 2, 3], "label": ["a", "b", "a"]})

# From a pandas DataFrame
import pandas as pd
pandas_df = pd.DataFrame({"x": [1.0, 2.0], "y": [3.0, 4.0]})
h2o_frame = h2o.H2OFrame(pandas_df)
library(h2o)
h2o.init()

# From an R data.frame
r_df <- data.frame(x = c(1, 2, 3), label = c("a", "b", "a"))
h2o_frame <- as.h2o(r_df)
Data is not held in the Python or R process. The H2OFrame object contains a reference (key) to the data stored in the cluster’s distributed key-value store (DKV).

Vec

A Vec is a single distributed column. Conceptually it is a database column, but the data is split into Chunks and spread across nodes. All Vecs in a Frame share a VectorGroup, which guarantees that same-numbered chunks across all columns cover the same row ranges. You generally do not interact with Vec directly in Python or R — it is an internal Java class (water.fvec.Vec).

Chunk

A Chunk is a contiguous block of rows within a single Vec, stored entirely on one node. Chunks typically hold between 1,000 and 1,000,000 rows. MRTask computations operate on one Chunk at a time, on the node where the Chunk lives, avoiding network data movement.

Supported column types

H2O-3 supports the following column types:
TypeDescriptionPython aliasR alias
numeric64-bit floating point (covers int and real)"numeric"numeric
categorical / enumFactor with an internal integer encoding"enum" or "factor"factor
stringVariable-length text"string"character
timeUnix timestamp in milliseconds"time"POSIXct
uuid128-bit identifier"uuid"
Algorithms treat enum/factor columns as classification targets and numeric columns as regression targets. If your target column is stored as numeric but represents classes, convert it with asfactor() / as.factor() before training.

How data is distributed

When you import a file or create a Frame, H2O-3 divides the data into Chunks and distributes them across nodes. The distribution is determined by consistent hashing of the Vec’s Key. The chunk size is chosen automatically based on available memory and cluster size. Each node holds a roughly equal share of each column. Because chunk alignment is guaranteed by the VectorGroup, processing a row requires reading the same-indexed chunk from each Vec — all on the same node.

Key operations

Dimensions and structure

df.nrow         # number of rows
df.ncol         # number of columns
df.shape        # (nrow, ncol) tuple
df.columns      # list of column names
df.dtypes       # dict of {column: type}

Summary statistics

df.describe()       # count, mean, std, min, max for each column
df.summary()        # same as describe
df.head(10)         # first 10 rows
df.tail(5)          # last 5 rows

Column types

df["col"].type          # e.g. "numeric", "enum", "string"
df["col"].isfactor()    # True if categorical
df["col"].isnumeric()   # True if numeric

# Convert column type
df["col"] = df["col"].asfactor()    # to categorical
df["col"] = df["col"].asnumeric()   # to numeric

Converting to and from pandas / R data.frame

import h2o
import pandas as pd
h2o.init()

# pandas → H2OFrame
pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
h2o_frame = h2o.H2OFrame(pandas_df)

# H2OFrame → pandas
# Pulls all data from the cluster into local memory — use with care on large frames
back_to_pandas = h2o_frame.as_data_frame()
as_data_frame() transfers the entire frame from the cluster to your local Python process. Only use this for small frames or subsets.

Creating frames directly

import h2o
h2o.init()

# From a list (single column)
frame = h2o.H2OFrame([1, 2, 2.5, -100.9, 0])

# From a dict (multiple columns)
frame = h2o.H2OFrame({
    "age":    [25, 30, 35],
    "salary": [50000, 60000, 75000],
    "dept":   ["eng", "sales", "eng"]
})

# From a list of lists (rows)
frame = h2o.H2OFrame([[1, "a"], [2, "b"], [3, "c"]],
                     column_names=["id", "label"])

Build docs developers (and LLMs) love