Skip to main content
H2O-3 provides a rich set of data manipulation operations that run distributed across the cluster. Most operations are lazy — they build an expression tree that is evaluated only when a result is needed.

Importing data

import_file vs. upload_file

import_file

A parallelized server-side read. The path is resolved by the H2O cluster, not the client. Use this for production workloads — it is fast, scalable, and does not route data through the client.

upload_file

A client-to-server push. The path is resolved on the machine running Python or R. Use this only for small local files during development.
import h2o
h2o.init()

# Import from S3 (server-side read)
airlines = h2o.import_file(
    "https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"
)

# Import from HDFS (include the node name)
df = h2o.import_file("hdfs://node-1:/user/smalldata/airlines/allyears2k_headers.zip")

# Upload a local file (client-side push — small files only)
iris = h2o.upload_file("../smalldata/iris/iris_wheader.csv")
When parsing files that contain timestamps without a timezone, H2O-3 interprets them as UTC. To override: Pythonh2o.cluster().timezone = "America/Los_Angeles" / Rh2o.setTimezone("America/Los_Angeles").

Supported file formats

H2O-3 can parse all of the following formats:
FormatNotes
CSV / TSVAuto-detects delimiter, quoting, and header
ParquetColumnar; preferred for large datasets
ORCColumnar; common in Hadoop ecosystems
AvroSchema-embedded binary format
ARFFWeka attribute-relation format
XLS / XLSXExcel spreadsheets
JSONNewline-delimited JSON records
SVMLightSparse feature format

Importing from a SQL table

# Python — import from a relational database
connection_url = "jdbc:mysql://localhost:3306/mydb?&useSSL=false"
username = "user"
password = "pass"

df = h2o.import_sql_table(
    connection_url=connection_url,
    table="customers",
    username=username,
    password=password
)
# R — import from a relational database
df <- h2o.import_sql_table(
  connection_url = "jdbc:mysql://localhost:3306/mydb?&useSSL=false",
  table = "customers",
  username = "user",
  password = "pass"
)

Slicing rows and columns

H2O-3 uses lazy slicing — the slice is only materialized when results are needed.

Slicing rows

import h2o
h2o.init()

path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"
df = h2o.import_file(path=path)

# Single row by index
row = df[15, :]

# Range of rows
rows = df[range(25, 50, 1), :]

# Boolean mask — rows where sepal_len < 4.6
mask = df["sepal_len"] < 4.6
filtered = df[mask, :]

# Filter out rows with missing values
mask = df["sepal_len"].isna()
no_missing = df[~mask, :]

Slicing columns

# By index
c1 = df[:, 0]

# By name
c1 = df[:, "sepal_len"]

# By range of indexes
cols = df[:, range(3)]

# By list of names
cols = df[:, ["sepal_wid", "petal_len", "petal_wid"]]

Merging and joining frames

Use merge to join two frames on a common column. By default all shared column names are used as the merge key. In a multi-node cluster, one of the frames must be small enough to fit in memory on every node.
import h2o
import numpy as np
h2o.init()

df1 = h2o.H2OFrame.from_python({
    "A": ["Hello", "World", "Welcome", "To", "H2O", "World"],
    "n": [0, 1, 2, 3, 4, 5]
})

df2 = h2o.H2OFrame.from_python(
    [[x] for x in np.random.randint(0, 10, size=20).tolist()],
    column_names=["n"]
)

# Inner join (default) — only rows with matching keys in both frames
df3 = df2.merge(df1)

# Left join — all rows from df2, NaN for non-matching keys
df4 = df2.merge(df1, all_x=True)
In multi-node clusters, one frame must fit in memory on every node for the merge to work correctly.

Group-by operations

group_by splits a frame into groups by one or more columns, applies an aggregation function, and returns a new frame. Results are sorted by the natural group-by column sort. Available aggregations: count, sum, mean, min, max, sd, var, ss, mode.
import h2o
h2o.init()

air = h2o.import_file(
    "https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k.zip"
)

# Count flights by origin airport
origin_flights = air.group_by("Origin").count().get_frame()

# Count flights per origin and month
flights_by_origin_month = (
    air.group_by(by=["Origin", "Month"])
       .count(na="all")
       .get_frame()
)

# Sum cancellations per month
cancellations = (
    air.group_by(by="Month")
       .sum("Cancelled", na="all")
       .get_frame()
)

# Multiple aggregations in one call
summary = (
    air[["Origin", "Dest", "IsArrDelayed", "IsDepDelayed"]]
       .group_by(by="Origin")
       .sum(["Dest", "IsArrDelayed", "IsDepDelayed"], na="ignore")
       .get_frame()
)
NA handling (na parameter):
  • "all" (default) — NA values propagate to the result.
  • "ignore" — NAs excluded from calculation; denominator is the full row count.
  • "rm" — NAs skipped; denominator is the non-NA row count.
A GroupBy object can only be used once. To apply different aggregations, create a new group_by call.

Handling missing values

fillna — sequential fill

Fill NA values forward or backward along rows or columns, up to a maximum run length.
import h2o
h2o.init()

df = h2o.create_frame(rows=10, cols=3,
                      real_fraction=1.0, real_range=100,
                      missing_fraction=0.2, seed=123)

# Forward-fill row-wise (axis=0), one consecutive NA at a time
filled = df.fillna(method="forward", axis=0, maxlen=1)

# Backward-fill column-wise (axis=1), up to 3 consecutive NAs
filled = df.fillna(method="backward", axis=1, maxlen=3)

impute — aggregate-based imputation

Replace NAs with the column mean, median, or mode — optionally computed within groups.
import h2o
h2o.init()

air = h2o.import_file(
    "https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k.zip"
)

# Mean-impute DepTime using Origin and Distance as grouping columns
air.impute("DepTime", method="mean", by=["Origin", "Distance"])

# Mode-impute a categorical column
air.impute("TailNum", method="mode")

# Mode-impute grouped by Month and Year
air.impute("TailNum", method="mode", by=["Month", "Year"])
impute() modifies the frame in place. Re-import the data to revert.
Available imputation methods:
MethodApplies toDescription
meannumericReplace NAs with the column mean
mediannumericReplace NAs with the column median
modecategorical (enum/factor)Replace NAs with the most frequent value

Type conversion

H2O-3 algorithms treat enum/factor columns as categorical (classification) and numeric columns as continuous (regression). Converting types is a common preprocessing step.
import h2o
h2o.init()

cars = h2o.import_file(
    "https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv"
)

# Numeric → factor (categorical)
cars["cylinders"] = cars["cylinders"].asfactor()
print(cars["cylinders"].isfactor())  # [True]

# Factor → numeric
cars["cylinders"] = cars["cylinders"].asnumeric()
print(cars["cylinders"].isnumeric())  # [True]

# Convert multiple columns at once
cars[["cylinders", "economy_20mpg"]] = (
    cars[["cylinders", "economy_20mpg"]].asfactor()
)

# enum → numeric: go via character to preserve mapped values
cars["name"] = cars["name"].ascharacter().asnumeric()

# Parse date strings
hdf = h2o.import_file(
    "https://s3.amazonaws.com/h2o-public-test-data/smalldata/jira/v-11-eurodate.csv"
)
hdf["ds5"].as_date("%d.%m.%y %H:%M")

# Extract date parts
hdf["ds3"].year()
hdf["ds3"].month()
hdf["ds3"].dayOfWeek()
hdf["ds3"].hour()
When converting an enum (factor) column to numeric, always go through ascharacter() / as.character() first. Direct conversion maps to underlying factor integer codes, not the original values.

Build docs developers (and LLMs) love