Data Wrangling with dplyr

The arrow package provides a dplyr backend that allows you to manipulate Arrow data using familiar tidyverse syntax. This works with both in-memory Table objects and on-disk Dataset objects, enabling analysis of data that’s larger than memory.

Getting Started

Load both arrow and dplyr to enable the integration:

library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)

# Create an Arrow Table
sw <- arrow_table(starwars, as_data_frame = FALSE)

One-Table Verbs

Arrow supports most single-table dplyr verbs:

result <- sw |>
  filter(homeworld == "Tatooine") |>
  rename(height_cm = height, mass_kg = mass) |>
  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) |>
  arrange(desc(birth_year)) |>
  select(name, height_in, mass_lbs)

Lazy Evaluation

Arrow uses lazy evaluation - computations aren’t performed until explicitly requested:

result
#> Table (query)
#> name: string
#> height_in: double
#> mass_lbs: double
#>
#> See $.data for the source Arrow object

The result is an arrow_dplyr_query object representing the computation, not the actual data.

Executing Queries

Use compute() or collect() to materialize results:

# Returns an Arrow Table
compute(result)

# Returns an R data frame
collect(result)
#>   name              height_in  mass_lbs
#> 1 C-3PO              65.35433  165.3467
#> 2 R5-D4              38.18898        NA
#> 3 Biggs Darklighter  72.04724  185.1876

Use compute() to keep data in Arrow format for further Arrow/dplyr operations. Use collect() when you need an R data frame for base R functions or visualization.

Aggregation and Grouping

Arrow supports group_by() and summarize() for aggregations:

sw |>
  group_by(species) |>
  summarize(mean_height = mean(height, na.rm = TRUE)) |>
  collect()
#> # A tibble: 38 × 2
#>    species   mean_height
#>    <chr>           <dbl>
#>  1 Human            177.
#>  2 Droid            140.
#>  3 Wookiee          231
#>  4 Rodian           173
#>  5 Hutt             175

Convenience Functions

Common dplyr helpers work as expected:

sw |>
  count(gender) |>
  collect()
#> # A tibble: 3 × 2
#>   gender        n
#>   <chr>     <int>
#> 1 masculine    66
#> 2 feminine     17
#> 3 NA            4

Window functions like ntile() are not yet supported in Arrow queries. These require collecting data to R first.

Two-Table Verbs (Joins)

Arrow supports equality joins:

jedi <- data.frame(
  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
  jedi = c(FALSE, TRUE, TRUE)
)

sw |>
  select(name, height, mass) |>
  right_join(jedi, by = "name") |>
  collect()
#>              name height mass  jedi
#> 1           C-3PO    167   75 FALSE
#> 2  Luke Skywalker    172   77  TRUE
#> 3 Obi-Wan Kenobi    182   77  TRUE

Supported join types:

left_join()
right_join()
inner_join()
full_join()
semi_join()
anti_join()

Supported Functions

Arrow provides support for many R functions within dplyr verbs:

String Functions

sw |>
  mutate(
    name_upper = toupper(name),
    name_length = nchar(name),
    has_skywalker = grepl("Skywalker", name)
  ) |>
  select(name, name_upper, name_length, has_skywalker) |>
  head() |>
  collect()

Math Functions

sw |>
  filter(!is.na(height), !is.na(mass)) |>
  mutate(
    bmi = mass / (height / 100)^2,
    log_mass = log(mass),
    rounded_height = round(height, -1)
  ) |>
  select(name, bmi, log_mass, rounded_height) |>
  head() |>
  collect()

Date/Time Functions

flights <- arrow_table(
  date = as.Date(c("2024-01-15", "2024-02-20", "2024-03-10")),
  timestamp = as.POSIXct(c("2024-01-15 14:30:00", 
                            "2024-02-20 09:15:00",
                            "2024-03-10 18:45:00"))
)

flights |>
  mutate(
    year = year(date),
    month = month(date),
    hour = hour(timestamp),
    weekday = wday(date, label = TRUE)
  ) |>
  collect()

Type Conversions

sw |>
  mutate(
    height_chr = as.character(height),
    mass_int = as.integer(mass),
    is_tall = as.logical(height > 180)
  ) |>
  select(name, height_chr, mass_int, is_tall) |>
  head() |>
  collect()

For a complete list of supported functions, see the Arrow R function documentation.

Custom Function Bindings

# Define a custom function
to_snake_name <- function(context, string) {
  replace <- c(`'` = "", `"` = "", `-` = "", `\\.` = "_", ` ` = "_")
  string |>
    stringr::str_replace_all(replace) |>
    stringr::str_to_lower() |>
    stringi::stri_trans_general(id = "Latin-ASCII")
}

# Register the function
register_scalar_function(
  name = "to_snake_name",
  fun = to_snake_name,
  in_type = utf8(),
  out_type = utf8(),
  auto_convert = TRUE
)

# Use in queries
sw |>
  mutate(snake_name = to_snake_name(name)) |>
  select(name, snake_name) |>
  head() |>
  collect()
#>              name       snake_name
#> 1 Luke Skywalker   luke_skywalker
#> 2          C-3PO            c_3po
#> 3          R2-D2            r2_d2

Handling Unsupported Expressions

With Tables (in-memory)

For Arrow Tables, unsupported functions trigger automatic collect():

sw |>
  filter(!is.na(height), !is.na(mass)) |>
  transmute(name, height, mass, res = residuals(lm(mass ~ height)))
#> # A tibble: 59 × 4
#>    name               height  mass    res
#>    <chr>               <int> <dbl>  <dbl>
#>  1 Luke Skywalker        172    77  -6.95
#>  2 C-3PO                 167    75  -5.07

This automatically converts to a data frame before applying the unsupported operation.

With Datasets (larger-than-memory)

For Datasets, unsupported expressions raise an error to prevent accidentally loading huge data:

# Write dataset
dataset_path <- tempfile()
write_dataset(starwars, dataset_path)
sw2 <- open_dataset(dataset_path)

# This raises an error
sw2 |>
  filter(!is.na(height), !is.na(mass)) |>
  transmute(name, height, mass, res = residuals(lm(mass ~ height)))
#> Error: Expression residuals(lm(mass ~ height)) not supported

Explicitly call collect() when you need unsupported operations:

sw2 |>
  filter(!is.na(height), !is.na(mass)) |>
  collect() |>  # Bring data into R
  transmute(name, height, mass, res = residuals(lm(mass ~ height)))

Be cautious with collect() on Datasets. Filter and select columns first to reduce the amount of data loaded into memory.

Integration with DuckDB

For operations not supported by Arrow, use DuckDB which has native Arrow support:

library(duckdb)

sw |>
  select(name, height, hair_color, eye_color) |>
  filter(!is.na(hair_color)) |>
  to_duckdb() |>
  # DuckDB operations here
  group_by(hair_color) |>
  filter(height < mean(height, na.rm = TRUE)) |>
  to_arrow() |>
  # Back to Arrow for more operations
  arrange(height) |>
  collect()

The to_duckdb() and to_arrow() functions provide zero-copy data transfer between Arrow and DuckDB.

Performance Tips

1. Filter Early

Apply filters as early as possible to reduce data volume:

# Good: Filter first
ds |>
  filter(year == 2020, passenger_count > 0) |>
  group_by(month) |>
  summarize(avg_fare = mean(fare_amount)) |>
  collect()

2. Select Only Needed Columns

Read only the columns you need:

# Good: Select specific columns
ds |>
  select(passenger_count, fare_amount, tip_amount) |>
  filter(fare_amount > 100) |>
  collect()

3. Use Lazy Evaluation

Build up queries without intermediate collections:

# Good: Single pipeline with final collect
result <- ds |>
  filter(condition1) |>
  mutate(new_col = expr) |>
  group_by(group_var) |>
  summarize(stat = mean(value)) |>
  collect()

4. Leverage Partitioning

Filter on partition columns to skip entire files:

# If data is partitioned by year and month
ds |>
  filter(year == 2020, month %in% c(1, 2, 3)) |>  # Reads only Q1 files
  summarize(total = sum(amount)) |>
  collect()

Real-World Example: NYC Taxi Data

Analyze a large dataset efficiently:

# Open multi-file dataset (1.7 billion rows, 70GB)
nyc_taxi <- open_dataset("nyc-taxi/")

# Complex query on large data
system.time(
  result <- nyc_taxi |>
    filter(total_amount > 100, year == 2015) |>
    select(tip_amount, total_amount, passenger_count) |>
    mutate(tip_pct = 100 * tip_amount / total_amount) |>
    group_by(passenger_count) |>
    summarise(
      median_tip_pct = median(tip_pct),
      n = n()
    ) |>
    arrange(desc(median_tip_pct)) |>
    collect()
)
#>    user  system elapsed
#>   4.436   1.012   1.402

result
#> # A tibble: 10 × 3
#>    passenger_count median_tip_pct      n
#>              <int>          <dbl>  <int>
#>  1               1           16.6 143087
#>  2               2           16.2  34418
#>  3               5           16.7   5806

This query processes billions of rows in seconds thanks to:

Lazy evaluation (optimized query plan)
Predicate pushdown (filters during file reading)
Partition pruning (skips files from other years)
Column projection (reads only needed columns)

Next Steps

Datasets

Work with multi-file and larger-than-memory data

Reading Files

Read and write various file formats efficiently

C++

Python

R

Ruby

Other Languages

Data Wrangling with dplyr

Getting Started

One-Table Verbs

Lazy Evaluation

Executing Queries

Aggregation and Grouping

Convenience Functions

Two-Table Verbs (Joins)

Supported Functions

String Functions

Math Functions

Date/Time Functions

Type Conversions

Custom Function Bindings

Handling Unsupported Expressions

With Tables (in-memory)

With Datasets (larger-than-memory)

Integration with DuckDB

Performance Tips

1. Filter Early

2. Select Only Needed Columns

3. Use Lazy Evaluation

4. Leverage Partitioning

Real-World Example: NYC Taxi Data

Next Steps

Datasets

Reading Files

Additional Resources

Build docs developers (and LLMs) love

C++

Python

R

Ruby

Other Languages

​Getting Started

​One-Table Verbs

​Lazy Evaluation

​Executing Queries

​Aggregation and Grouping

​Convenience Functions

​Two-Table Verbs (Joins)

​Supported Functions

​String Functions

​Math Functions

​Date/Time Functions

​Type Conversions

​Custom Function Bindings

​Handling Unsupported Expressions

​With Tables (in-memory)

​With Datasets (larger-than-memory)

​Integration with DuckDB

​Performance Tips

​1. Filter Early

​2. Select Only Needed Columns

​3. Use Lazy Evaluation

​4. Leverage Partitioning

​Real-World Example: NYC Taxi Data

​Next Steps

Datasets

Reading Files

​Additional Resources

Build docs developers (and LLMs) love

Getting Started

One-Table Verbs

Lazy Evaluation

Executing Queries

Aggregation and Grouping

Convenience Functions

Two-Table Verbs (Joins)

Supported Functions

String Functions

Math Functions

Date/Time Functions

Type Conversions

Custom Function Bindings

Handling Unsupported Expressions

With Tables (in-memory)

With Datasets (larger-than-memory)

Integration with DuckDB

Performance Tips

1. Filter Early

2. Select Only Needed Columns

3. Use Lazy Evaluation

4. Leverage Partitioning

Real-World Example: NYC Taxi Data

Next Steps

Additional Resources