H2OFrame

H2OFrame is the core data container in H2O. Data is stored on the H2O cluster (which may be remote), and the Python object is a lightweight handle. Operations on an H2OFrame are executed lazily and distributed across the cluster.

from h2o.frame import H2OFrame
import h2o

H2OFrame is also accessible directly as h2o.H2OFrame for convenience.

Construction

From a Python object

H2OFrame(
    python_obj=None, destination_frame=None,
    header=0, separator=",",
    column_names=None, column_types=None,
    na_strings=None, skipped_columns=None,
    force_col_types=False
)

python_obj

list | dict | numpy.ndarray | pandas.DataFrame | scipy.sparse

The source object to convert. Accepted types:

None — creates an empty frame
A flat list or tuple — creates a single-column frame
A {name: list} dictionary — creates a multi-column frame
A list of lists — rows of a rectangular table
A Pandas DataFrame or NumPy ndarray
A SciPy sparse matrix

destination_frame

string

Key to assign the frame in H2O’s distributed key-value store. Auto-generated if not provided.

header

integer

default:"0"

Header detection when python_obj is a list of lists. -1 = first row is data, 1 = first row is header, 0 = guess.

column_names

string[]

Explicit column names. Overrides any names derived from the source data.

column_types

string[] | object

Explicit column types. Valid values: "unknown", "uuid", "string", "float", "real", "double", "int", "long", "numeric", "categorical", "factor", "enum", "time".

na_strings

string[] | string[][] | object

Strings to interpret as missing values. Can be specified globally, per-column as a list-of-lists, or as a {column: list} dict.

frame = h2o.H2OFrame([1, 2, 2.5, -100.9, 0])

From an imported file

frame = h2o.import_file("s3://bucket/data.csv")

See h2o.import_file and h2o.upload_file for file-based construction.

Properties

nrows / nrow

frame.nrows  # int

Number of rows in the frame.

ncols / ncol

frame.ncols  # int

Number of columns in the frame.

shape

frame.shape  # (nrows, ncols)

Tuple of (nrows, ncols).

columns / names

frame.columns   # list[str]
frame.names     # list[str]  (alias)

List of column names. Assigning to frame.columns renames all columns.

types

frame.types  # dict[str, str]

Dictionary mapping column names to their H2O type strings (e.g., "int", "real", "enum", "string", "time").

frame_id

frame.frame_id  # str

Internal key identifying this frame on the H2O cluster.

Inspection

head / tail

frame.head(rows=10, cols=200)
frame.tail(rows=10, cols=200)

Returns the first or last rows rows as an H2OFrame.

describe

frame.describe()

Print a summary of column types and the first ten rows.

summary

frame.summary()

Print statistical summaries (mean, min, max, quantiles, missing counts) for each column.

as_data_frame

frame.as_data_frame(use_pandas=True)

Convert the H2OFrame to a local Pandas DataFrame (or a Python list of lists when use_pandas=False).

use_pandas

boolean

default:"True"

Return a Pandas DataFrame when True. Returns a list of lists otherwise.

df = frame.as_data_frame()

Indexing and slicing

H2OFrame supports NumPy-style indexing using frame[row_selector, col_selector].

# Single column by name
col = frame["age"]

# Multiple columns
subset = frame[["age", "salary"]]

# Row slice
first_100 = frame[:100, :]

# Row and column slice
block = frame[10:20, 0:3]

# Boolean row filter
filtered = frame[frame["salary"] > 60000, :]

Column operations

cbind

combined = frame1.cbind(frame2)

Append the columns of frame2 to frame1. Both frames must have the same number of rows.

rbind

stacked = frame1.rbind(frame2)

Append the rows of frame2 below frame1. Both frames must have the same column structure.

merge

result = frame1.merge(frame2, by_x=None, by_y=None, all_x=False, all_y=False)

Merge (join) two H2OFrames.

by_x

string | string[]

Column name(s) in frame1 to join on.

by_y

string | string[]

Column name(s) in frame2 to join on. Defaults to by_x.

all_x

boolean

default:"False"

Perform a left outer join when True.

Statistical operations

mean

frame.mean(skipna=True, axis=0)

Compute the column-wise (or row-wise when axis=1) mean.

var

frame.var(y=None, na_rm=True, use="everything", weights_column=None)

Compute variance or the variance-covariance matrix for two columns.

sd

frame.sd(na_rm=True)

Compute the standard deviation of each column.

cor

frame.cor(y=None, na_rm=True, use="everything", weights_column=None)

Compute the correlation between columns.

Type casting

asfactor

frame["species"] = frame["species"].asfactor()

Convert a column to a categorical (factor/enum) type. Required for classification response columns.

asnumeric

frame["flag"] = frame["flag"].asnumeric()

Convert a column to a numeric type.

ascharacter

frame["code"] = frame["code"].ascharacter()

Convert a column to a string type.

Splitting frames

train, valid, test = frame.split_frame(ratios=[0.7, 0.15], seed=42)

Split the frame into two or more parts. The last part receives the remainder. The seed parameter controls reproducibility.

ratios

float[]

Fractions for each split. Must sum to less than 1.0.

seed

integer

Random seed for reproducibility.

Python API

R API

REST API

Construction

From a Python object

From an imported file

Properties

nrows / nrow

ncols / ncol

shape

columns / names

types

frame_id

Inspection

head / tail

describe

summary

as_data_frame

Indexing and slicing

Column operations

cbind

rbind

merge

Statistical operations

mean

var

sd

cor

Type casting

asfactor

asnumeric

ascharacter

Splitting frames

Build docs developers (and LLMs) love

Python API

R API

REST API

​Construction

​From a Python object

​From an imported file

​Properties

​nrows / nrow

​ncols / ncol

​shape

​columns / names

​types

​frame_id

​Inspection

​head / tail

​describe

​summary

​as_data_frame

​Indexing and slicing

​Column operations

​cbind

​rbind

​merge

​Statistical operations

​mean

​var

​sd

​cor

​Type casting

​asfactor

​asnumeric

​ascharacter

​Splitting frames

Build docs developers (and LLMs) love

Construction

From a Python object

From an imported file

Properties

nrows / nrow

ncols / ncol

shape

columns / names

types

frame_id

Inspection

head / tail

describe

summary

as_data_frame

Indexing and slicing

Column operations

cbind

rbind

merge

Statistical operations

mean

var

sd

cor

Type casting

asfactor

asnumeric

ascharacter

Splitting frames