The arrow package provides high-performance functions for reading and writing data files in multiple formats. By default, these functions return R data frames, but they can also work with Arrow Tables for better performance in subsequent operations.
Arrow supports reading and writing several file formats:
Format Read Function Write Function Best For Parquet read_parquet()write_parquet()Analytics, long-term storage Feather/Arrow read_feather()write_feather()Fast I/O, R/Python interchange CSV read_csv_arrow()write_csv_arrow()Interoperability, human-readable TSV read_tsv_arrow()- Tab-separated data Delimited read_delim_arrow()- Custom delimiters JSON read_json_arrow()- Line-delimited JSON
Parquet is a columnar storage format optimized for analytics workloads. It offers excellent compression and fast read performance.
Writing Parquet Files
library (arrow, warn.conflicts = FALSE )
library (dplyr, warn.conflicts = FALSE )
file_path <- tempfile ( fileext = ".parquet" )
write_parquet(starwars, file_path)
Compression Options:
# Snappy compression (default) - balanced speed and size
write_parquet(starwars, "data.parquet" , compression = "snappy" )
# Gzip compression - smaller files, slower
write_parquet(starwars, "data.parquet" , compression = "gzip" )
# No compression - fastest write, largest files
write_parquet(starwars, "data.parquet" , compression = "uncompressed" )
# Zstd compression - excellent compression ratio
write_parquet(starwars, "data.parquet" , compression = "zstd" )
Reading Parquet Files
# Read as data frame (default)
df <- read_parquet(file_path)
df
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
# Read as Arrow Table
tbl <- read_parquet(file_path, as_data_frame = FALSE )
tbl
#> Table
#> 87 rows x 14 columns
Reading as an Arrow Table (as_data_frame = FALSE) is faster and uses less memory if you plan to do further Arrow/dplyr operations before converting to a data frame.
Selective Column Reading
Parquet files store metadata that allows reading only specific columns:
# Read only selected columns
read_parquet(file_path, col_select = c ( "name" , "height" , "mass" ))
#> # A tibble: 87 × 3
#> name height mass
#> <chr> <int> <dbl>
#> 1 Luke Skywalker 172 77
#> 2 C-3PO 167 75
#> 3 R2-D2 96 32
# Use tidyselect helpers
read_parquet(file_path, col_select = starts_with( "h" ))
This is much faster than reading the full file and selecting columns afterwards.
Advanced Parquet Options
Fine-tune the Parquet reader with props:
props <- ParquetArrowReaderProperties $ create(
use_threads = TRUE ,
cache_options = CacheOptions $ default()
)
read_parquet(file_path, props = props)
The Arrow IPC file format (also called Feather v2) is designed for fast reading and writing, especially between Arrow-compatible systems.
Writing Feather Files
file_path <- tempfile ( fileext = ".arrow" )
write_feather(starwars, file_path)
# With compression
write_feather(starwars, file_path, compression = "lz4" )
write_feather(starwars, file_path, compression = "zstd" )
Reading Feather Files
# Read as data frame
df <- read_feather(file_path)
# Read as Arrow Table with selected columns
tbl <- read_feather(
file = file_path,
col_select = c ( "name" , "height" , "mass" ),
as_data_frame = FALSE
)
When to Use Feather
Use Feather when:
You need maximum I/O speed
Exchanging data between R and Python
Working with temporary files in a pipeline
You have fast storage and don’t need compression
Use Parquet when:
Long-term data storage
Minimizing storage costs
Sharing data with other analytics tools
Working with very large datasets
Arrow provides fast CSV reading and writing using the Arrow C++ CSV parser.
Writing CSV Files
file_path <- tempfile ( fileext = ".csv" )
write_csv_arrow(mtcars, file_path)
Reading CSV Files
# Basic reading
df <- read_csv_arrow(file_path)
# Select specific columns
df <- read_csv_arrow(file_path, col_select = starts_with( "d" ))
#> # A tibble: 32 × 2
#> disp drat
#> <dbl> <dbl>
#> 1 160 3.90
#> 2 160 3.90
CSV Parsing Options
The CSV reader supports many options similar to readr:
read_csv_arrow(
file_path,
delim = "," ,
quote = " \" " ,
escape_double = TRUE ,
escape_backslash = FALSE ,
col_names = TRUE ,
skip = 0 ,
skip_empty_rows = TRUE ,
na = c ( "" , "NA" )
)
Specifying Column Types
Use col_types or schema to specify data types:
# Using col_types (readr style)
read_csv_arrow(
file_path,
col_types = "dddcc" , # double, double, double, character, character
col_names = c ( "col1" , "col2" , "col3" , "col4" , "col5" )
)
# Using Arrow schema
read_csv_arrow(
file_path,
schema = schema(
name = utf8(),
age = int32(),
height = float64(),
active = boolean()
)
)
Advanced CSV Options
For fine-grained control, use the options objects:
read_csv_arrow(
file_path,
parse_options = CsvParseOptions $ create(
delimiter = "|" ,
quote_char = "'" ,
newlines_in_values = TRUE
),
convert_options = CsvConvertOptions $ create(
check_utf8 = TRUE ,
strings_can_be_null = TRUE
),
read_options = CsvReadOptions $ create(
use_threads = TRUE ,
block_size = 1048576
)
)
TSV and Custom Delimiters
# Tab-separated values
df <- read_tsv_arrow( "data.tsv" )
# Custom delimiter
df <- read_delim_arrow( "data.txt" , delim = "|" )
Arrow can read line-delimited JSON files:
file_path <- tempfile ( fileext = ".json" )
writeLines ( '
{ "hello": 3.5, "world": false, "yo": "thing" }
{ "hello": 3.25, "world": null }
{ "hello": 0.0, "world": true, "yo": null }
' , file_path, useBytes = TRUE )
read_json_arrow(file_path)
#> # A tibble: 3 × 3
#> hello world yo
#> <dbl> <lgl> <chr>
#> 1 3.5 FALSE thing
#> 2 3.25 NA <NA>
#> 3 0 TRUE <NA>
Arrow only supports line-delimited JSON (NDJSON), where each line is a complete JSON object. Standard JSON arrays are not supported.
Cloud Storage
All read/write functions work with cloud storage paths:
Amazon S3
library (arrow)
# Write to S3
write_parquet(mtcars, "s3://my-bucket/data/mtcars.parquet" )
# Read from S3
df <- read_parquet( "s3://my-bucket/data/mtcars.parquet" )
# Configure S3 options
Sys.setenv (
AWS_ACCESS_KEY_ID = "your-access-key" ,
AWS_SECRET_ACCESS_KEY = "your-secret-key" ,
AWS_REGION = "us-east-1"
)
Google Cloud Storage
# Write to GCS
write_parquet(mtcars, "gs://my-bucket/data/mtcars.parquet" )
# Read from GCS
df <- read_parquet( "gs://my-bucket/data/mtcars.parquet" )
Arrow preserves R object attributes when writing to Parquet or Feather:
# Data with attributes
df <- data.frame ( x = 1 : 3 , y = 4 : 6 )
attr (df, "description" ) <- "My dataset"
attr (df $ x, "label" ) <- "X variable"
# Write and read back
file_path <- tempfile ( fileext = ".parquet" )
write_parquet(df, file_path)
df2 <- read_parquet(file_path)
# Attributes are preserved
attr (df2, "description" )
#> [1] "My dataset"
attr (df2 $ x, "label" )
#> [1] "X variable"
This enables round-trip preservation of:
sf spatial objects
haven::labelled columns
Custom attributes
Factor levels
Comparing read/write performance:
library (bench)
# Create test data
large_df <- data.frame (
id = 1 : 1e6 ,
value = rnorm ( 1e6 ),
category = sample ( letters [ 1 : 10 ], 1e6 , replace = TRUE ),
timestamp = Sys.time () + 1 : 1e6
)
# Benchmark writes
bench::mark(
parquet = write_parquet(large_df, tempfile ()),
feather = write_feather(large_df, tempfile ()),
csv = write_csv_arrow(large_df, tempfile ()),
check = FALSE ,
iterations = 5
)
#> expression min median itr/sec mem_alloc
#> 1 parquet 250ms 260ms 3.85 45.2MB
#> 2 feather 85ms 90ms 11.1 38.1MB
#> 3 csv 180ms 190ms 5.26 61.3MB
Typical performance characteristics:
Feather : Fastest I/O, moderate file size
Parquet : Good I/O speed, smallest file size
CSV : Slowest I/O, human-readable, largest file size
Best Practices
# For long-term storage
write_parquet(data, "archive/data.parquet" , compression = "zstd" )
# For temporary files in a pipeline
write_feather(data, "temp/data.arrow" , compression = "lz4" )
# For sharing with non-Arrow tools
write_csv_arrow(data, "export/data.csv" )
2. Use Arrow Tables for Pipelines
# Good: Stay in Arrow format
result <- read_parquet( "input.parquet" , as_data_frame = FALSE ) |>
filter (value > 100 ) |>
mutate( new_col = value * 2 ) |>
collect()
# Less efficient: Multiple conversions
result <- read_parquet( "input.parquet" ) |> # Converts to data frame
filter (value > 100 ) |> # Works on data frame
mutate( new_col = value * 2 )
3. Read Only What You Need
# Efficient: Read specific columns
df <- read_parquet( "large_file.parquet" , col_select = c ( "id" , "value" ))
# Inefficient: Read everything then select
df <- read_parquet( "large_file.parquet" ) |>
select(id, value)
4. Handle Large Files
# For files too large for memory, use datasets
ds <- open_dataset( "large_file.parquet" )
result <- ds |>
filter (date >= "2024-01-01" ) |>
select(id, value, date) |>
group_by(date) |>
summarize( total = sum (value)) |>
collect()
Real-World Example
Complete workflow with multiple formats:
library (arrow)
library (dplyr)
# 1. Read raw CSV data
raw_data <- read_csv_arrow( "raw_data.csv" ,
col_select = c ( "date" , "user_id" , "amount" , "category" )
)
# 2. Process with Arrow
processed <- arrow_table(raw_data) |>
filter ( ! is.na (amount), amount > 0 ) |>
mutate(
year = year(date),
month = month(date)
) |>
collect()
# 3. Save processed data in Parquet for long-term storage
write_parquet(
processed,
"processed_data.parquet" ,
compression = "zstd" ,
compression_level = 5
)
# 4. Create fast-access copy in Feather for daily use
write_feather(
processed,
"processed_data.arrow" ,
compression = "lz4"
)
# 5. Export summary to CSV for sharing
summary_data <- processed |>
group_by(category, year) |>
summarize( total_amount = sum (amount), n = n())
write_csv_arrow(summary_data, "summary.csv" )
Next Steps
Datasets Work with multi-file and larger-than-memory data
Data Wrangling Analyze data with dplyr syntax
Additional Resources