Skip to main content
H2OFrame is the core data container in H2O. Data is stored on the H2O cluster (which may be remote), and the Python object is a lightweight handle. Operations on an H2OFrame are executed lazily and distributed across the cluster.
from h2o.frame import H2OFrame
import h2o
H2OFrame is also accessible directly as h2o.H2OFrame for convenience.

Construction

From a Python object

H2OFrame(
    python_obj=None, destination_frame=None,
    header=0, separator=",",
    column_names=None, column_types=None,
    na_strings=None, skipped_columns=None,
    force_col_types=False
)
python_obj
list | dict | numpy.ndarray | pandas.DataFrame | scipy.sparse
The source object to convert. Accepted types:
  • None — creates an empty frame
  • A flat list or tuple — creates a single-column frame
  • A {name: list} dictionary — creates a multi-column frame
  • A list of lists — rows of a rectangular table
  • A Pandas DataFrame or NumPy ndarray
  • A SciPy sparse matrix
destination_frame
string
Key to assign the frame in H2O’s distributed key-value store. Auto-generated if not provided.
header
integer
default:"0"
Header detection when python_obj is a list of lists. -1 = first row is data, 1 = first row is header, 0 = guess.
column_names
string[]
Explicit column names. Overrides any names derived from the source data.
column_types
string[] | object
Explicit column types. Valid values: "unknown", "uuid", "string", "float", "real", "double", "int", "long", "numeric", "categorical", "factor", "enum", "time".
na_strings
string[] | string[][] | object
Strings to interpret as missing values. Can be specified globally, per-column as a list-of-lists, or as a {column: list} dict.
frame = h2o.H2OFrame([1, 2, 2.5, -100.9, 0])

From an imported file

frame = h2o.import_file("s3://bucket/data.csv")
See h2o.import_file and h2o.upload_file for file-based construction.

Properties

nrows / nrow

frame.nrows  # int
Number of rows in the frame.

ncols / ncol

frame.ncols  # int
Number of columns in the frame.

shape

frame.shape  # (nrows, ncols)
Tuple of (nrows, ncols).

columns / names

frame.columns   # list[str]
frame.names     # list[str]  (alias)
List of column names. Assigning to frame.columns renames all columns.

types

frame.types  # dict[str, str]
Dictionary mapping column names to their H2O type strings (e.g., "int", "real", "enum", "string", "time").

frame_id

frame.frame_id  # str
Internal key identifying this frame on the H2O cluster.

Inspection

head / tail

frame.head(rows=10, cols=200)
frame.tail(rows=10, cols=200)
Returns the first or last rows rows as an H2OFrame.

describe

frame.describe()
Print a summary of column types and the first ten rows.

summary

frame.summary()
Print statistical summaries (mean, min, max, quantiles, missing counts) for each column.

as_data_frame

frame.as_data_frame(use_pandas=True)
Convert the H2OFrame to a local Pandas DataFrame (or a Python list of lists when use_pandas=False).
use_pandas
boolean
default:"True"
Return a Pandas DataFrame when True. Returns a list of lists otherwise.
df = frame.as_data_frame()

Indexing and slicing

H2OFrame supports NumPy-style indexing using frame[row_selector, col_selector].
# Single column by name
col = frame["age"]

# Multiple columns
subset = frame[["age", "salary"]]

# Row slice
first_100 = frame[:100, :]

# Row and column slice
block = frame[10:20, 0:3]

# Boolean row filter
filtered = frame[frame["salary"] > 60000, :]

Column operations

cbind

combined = frame1.cbind(frame2)
Append the columns of frame2 to frame1. Both frames must have the same number of rows.

rbind

stacked = frame1.rbind(frame2)
Append the rows of frame2 below frame1. Both frames must have the same column structure.

merge

result = frame1.merge(frame2, by_x=None, by_y=None, all_x=False, all_y=False)
Merge (join) two H2OFrames.
by_x
string | string[]
Column name(s) in frame1 to join on.
by_y
string | string[]
Column name(s) in frame2 to join on. Defaults to by_x.
all_x
boolean
default:"False"
Perform a left outer join when True.

Statistical operations

mean

frame.mean(skipna=True, axis=0)
Compute the column-wise (or row-wise when axis=1) mean.

var

frame.var(y=None, na_rm=True, use="everything", weights_column=None)
Compute variance or the variance-covariance matrix for two columns.

sd

frame.sd(na_rm=True)
Compute the standard deviation of each column.

cor

frame.cor(y=None, na_rm=True, use="everything", weights_column=None)
Compute the correlation between columns.

Type casting

asfactor

frame["species"] = frame["species"].asfactor()
Convert a column to a categorical (factor/enum) type. Required for classification response columns.

asnumeric

frame["flag"] = frame["flag"].asnumeric()
Convert a column to a numeric type.

ascharacter

frame["code"] = frame["code"].ascharacter()
Convert a column to a string type.

Splitting frames

train, valid, test = frame.split_frame(ratios=[0.7, 0.15], seed=42)
Split the frame into two or more parts. The last part receives the remainder. The seed parameter controls reproducibility.
ratios
float[]
Fractions for each split. Must sum to less than 1.0.
seed
integer
Random seed for reproducibility.

Build docs developers (and LLMs) love