Import, slice, filter, join, aggregate, and transform data in H2O-3.
H2O-3 provides a rich set of data manipulation operations that run distributed across the cluster. Most operations are lazy — they build an expression tree that is evaluated only when a result is needed.
A parallelized server-side read. The path is resolved by the H2O cluster, not the client. Use this for production workloads — it is fast, scalable, and does not route data through the client.
upload_file
A client-to-server push. The path is resolved on the machine running Python or R. Use this only for small local files during development.
Python
R
import h2oh2o.init()# Import from S3 (server-side read)airlines = h2o.import_file( "https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")# Import from HDFS (include the node name)df = h2o.import_file("hdfs://node-1:/user/smalldata/airlines/allyears2k_headers.zip")# Upload a local file (client-side push — small files only)iris = h2o.upload_file("../smalldata/iris/iris_wheader.csv")
library(h2o)h2o.init()# Import from a local path bundled with the H2O packageiris_path <- system.file("extdata", "iris.csv", package = "h2o")iris <- h2o.importFile(path = iris_path)# Import from S3airlines <- h2o.importFile( "https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k.zip")# Import from HDFSdf <- h2o.importFile("hdfs://node-1:/user/smalldata/airlines/allyears2k_headers.zip")# Upload a local fileiris <- h2o.uploadFile(path = "../smalldata/iris/iris_wheader.csv")
When parsing files that contain timestamps without a timezone, H2O-3 interprets them as UTC. To override: Python — h2o.cluster().timezone = "America/Los_Angeles" / R — h2o.setTimezone("America/Los_Angeles").
# By indexc1 = df[:, 0]# By namec1 = df[:, "sepal_len"]# By range of indexescols = df[:, range(3)]# By list of namescols = df[:, ["sepal_wid", "petal_len", "petal_wid"]]
Use merge to join two frames on a common column. By default all shared column names are used as the merge key. In a multi-node cluster, one of the frames must be small enough to fit in memory on every node.
Python
R
import h2oimport numpy as nph2o.init()df1 = h2o.H2OFrame.from_python({ "A": ["Hello", "World", "Welcome", "To", "H2O", "World"], "n": [0, 1, 2, 3, 4, 5]})df2 = h2o.H2OFrame.from_python( [[x] for x in np.random.randint(0, 10, size=20).tolist()], column_names=["n"])# Inner join (default) — only rows with matching keys in both framesdf3 = df2.merge(df1)# Left join — all rows from df2, NaN for non-matching keysdf4 = df2.merge(df1, all_x=True)
library(h2o)h2o.init()left <- data.frame( fruit = c("apple", "orange", "banana", "lemon", "strawberry", "blueberry"), color = c("red", "orange", "yellow", "yellow", "red", "blue"))right <- data.frame( fruit = c("apple", "orange", "banana", "lemon", "strawberry", "watermelon"), citrus = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE))left_frame <- as.h2o(left)right_frame <- as.h2o(right)# Left join — all rows from left_frame, NA for non-matching keys in right_framemerged <- h2o.merge(left_frame, right_frame, all.x = TRUE)print(merged)# fruit color citrus# 1 blueberry blue <NA># 2 apple red FALSE# ...
In multi-node clusters, one frame must fit in memory on every node for the merge to work correctly.
group_by splits a frame into groups by one or more columns, applies an aggregation function, and returns a new frame. Results are sorted by the natural group-by column sort.Available aggregations: count, sum, mean, min, max, sd, var, ss, mode.
Python
R
import h2oh2o.init()air = h2o.import_file( "https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k.zip")# Count flights by origin airportorigin_flights = air.group_by("Origin").count().get_frame()# Count flights per origin and monthflights_by_origin_month = ( air.group_by(by=["Origin", "Month"]) .count(na="all") .get_frame())# Sum cancellations per monthcancellations = ( air.group_by(by="Month") .sum("Cancelled", na="all") .get_frame())# Multiple aggregations in one callsummary = ( air[["Origin", "Dest", "IsArrDelayed", "IsDepDelayed"]] .group_by(by="Origin") .sum(["Dest", "IsArrDelayed", "IsDepDelayed"], na="ignore") .get_frame())
NA handling (na parameter):
"all" (default) — NA values propagate to the result.
"ignore" — NAs excluded from calculation; denominator is the full row count.
"rm" — NAs skipped; denominator is the non-NA row count.
library(h2o)h2o.init()airlines <- h2o.importFile( "https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k.zip")# Count flights by origin airportorigin_flights <- h2o.group_by( data = airlines, by = "Origin", nrow("Origin"), gb.control = list(na.methods = "rm"))# Count flights per monthflights_by_month <- h2o.group_by( data = airlines, by = "Month", nrow("Month"), gb.control = list(na.methods = "rm"))# Sum cancellations per monthcancellations_by_month <- h2o.group_by( data = airlines, by = "Month", sum("Cancelled"), gb.control = list(na.methods = "rm"))# Cancellation rate per monthrate <- cancellations_by_month$sum_Cancelled / flights_by_month$nrowrates_table <- h2o.cbind(flights_by_month$Month, rate)
A GroupBy object can only be used once. To apply different aggregations, create a new group_by call.
H2O-3 algorithms treat enum/factor columns as categorical (classification) and numeric columns as continuous (regression). Converting types is a common preprocessing step.
Python
R
import h2oh2o.init()cars = h2o.import_file( "https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")# Numeric → factor (categorical)cars["cylinders"] = cars["cylinders"].asfactor()print(cars["cylinders"].isfactor()) # [True]# Factor → numericcars["cylinders"] = cars["cylinders"].asnumeric()print(cars["cylinders"].isnumeric()) # [True]# Convert multiple columns at oncecars[["cylinders", "economy_20mpg"]] = ( cars[["cylinders", "economy_20mpg"]].asfactor())# enum → numeric: go via character to preserve mapped valuescars["name"] = cars["name"].ascharacter().asnumeric()# Parse date stringshdf = h2o.import_file( "https://s3.amazonaws.com/h2o-public-test-data/smalldata/jira/v-11-eurodate.csv")hdf["ds5"].as_date("%d.%m.%y %H:%M")# Extract date partshdf["ds3"].year()hdf["ds3"].month()hdf["ds3"].dayOfWeek()hdf["ds3"].hour()
library(h2o)h2o.init()cars <- h2o.importFile( "https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")# Numeric → factor (categorical)cars["cylinders"] <- as.factor(cars["cylinders"])print(h2o.isfactor(cars["cylinders"])) # TRUE# Factor → numericcars["cylinders"] <- as.numeric(cars["cylinders"])print(h2o.isnumeric(cars["cylinders"])) # TRUE# Convert multiple columnscars[c("cylinders", "economy_20mpg")] <- as.factor( cars[c("cylinders", "economy_20mpg")])# enum → numeric: go via character to preserve valuescars["name"] <- as.character(cars["name"])cars["name"] <- as.numeric(cars["name"])# Parse date stringshdf <- h2o.importFile( "https://s3.amazonaws.com/h2o-public-test-data/smalldata/jira/v-11-eurodate.csv")h2o.as_date(hdf["ds5"], c("%d.%m.%y %H:%M"))# Extract date partsh2o.year(hdf["ds3"])h2o.month(hdf["ds3"])h2o.dayOfWeek(hdf["ds3"])h2o.hour(hdf["ds3"])
When converting an enum (factor) column to numeric, always go through ascharacter() / as.character() first. Direct conversion maps to underlying factor integer codes, not the original values.