Ruby Quickstart

This guide will get you up and running with Red Arrow, the Ruby bindings for Apache Arrow. You’ll learn how to create arrays, build tables, and read/write data files.

Prerequisites

You’ll need:

Ruby 2.5 or higher
Bundler for dependency management
Basic familiarity with Ruby

Red Arrow uses GObject Introspection and requires Apache Arrow GLib to be installed on your system.

Install Red Arrow

Using RubyGems with Automatic System Dependencies

The easiest way is to use rubygems-requirements-system which automatically installs Arrow GLib:

gem install rubygems-requirements-system red-arrow

Using Bundler

Add to your Gemfile:

plugin "rubygems-requirements-system"

gem "red-arrow"

Then install:

bundle install

Manual Installation of Arrow GLib

If you prefer to install Arrow GLib manually:

macOS
Ubuntu/Debian

brew install apache-arrow-glib
gem install red-arrow

sudo apt update
sudo apt install -y libarrow-glib-dev
gem install red-arrow

Verify the installation:

require "arrow"
puts Arrow::VERSION

Create Your First Arrays

Arrays are the fundamental building blocks in Arrow - typed, homogeneous collections of data.

require "arrow"

# Create arrays of different types
uint8_array = Arrow::UInt8Array.new([1, 2, 4, 8])
int32_array = Arrow::Int32Array.new([1, -2, 4, -8])
float_array = Arrow::FloatArray.new([1.1, -2.2, 4.4, -8.8])
string_array = Arrow::StringArray.new(["Alice", "Bob", "Carol"])

puts uint8_array
# Output: [1, 2, 4, 8]

puts string_array
# Output: ["Alice", "Bob", "Carol"]

Create arrays with null values:

# nil values become nulls
array_with_nulls = Arrow::Int32Array.new([1, nil, 3, nil, 5])
puts array_with_nulls
# Output: [1, (null), 3, (null), 5]

puts "Length: #{array_with_nulls.length}"
puts "Null count: #{array_with_nulls.n_nulls}"

Access array elements:

array = Arrow::StringArray.new(["Alice", "Bob", "Carol"])

# Access by index
puts array[0]  # => "Alice"
puts array[1]  # => "Bob"

# Slice arrays
slice = array.slice(1, 2)  # Start at index 1, length 2
puts slice     # => ["Bob", "Carol"]

Build Tables with Schema

Tables combine multiple arrays into named columns with a defined schema.

require "arrow"

# Define schema with field names and types
fields = [
  Arrow::Field.new("day", :uint8),
  Arrow::Field.new("month", :uint8),
  Arrow::Field.new("year", :uint16)
]
schema = Arrow::Schema.new(fields)

# Create arrays for each column
days = Arrow::UInt8Array.new([1, 12, 17, 23, 28])
months = Arrow::UInt8Array.new([1, 3, 5, 7, 1])
years = Arrow::UInt16Array.new([1990, 2000, 1995, 2000, 1995])

# Build table
birthdays_table = Arrow::Table.new(schema, [days, months, years])

puts "Table with #{birthdays_table.n_rows} rows"
puts "Columns: #{birthdays_table.column_names.join(', ')}"

# Access columns
puts birthdays_table["year"]

Alternative: Create table from hash:

# More convenient for simple cases
table = Arrow::Table.new(
  "name" => Arrow::StringArray.new(["Alice", "Bob", "Carol"]),
  "age" => Arrow::UInt8Array.new([30, 25, 35]),
  "city" => Arrow::StringArray.new(["NYC", "SF", "LA"])
)

puts table.n_rows    # => 3
puts table.n_columns # => 3

Write and Read Arrow IPC Files

Arrow’s IPC format provides fast, zero-copy serialization.Write to file:

require "arrow"

# Create schema and data
fields = [
  Arrow::Field.new("uint8", :uint8),
  Arrow::Field.new("int32", :int32),
  Arrow::Field.new("float", :float)
]
schema = Arrow::Schema.new(fields)

# Write to file
Arrow::FileOutputStream.open("/tmp/data.arrow", false) do |output|
  Arrow::RecordBatchFileWriter.open(output, schema) do |writer|
    # Create record batches
    columns = [
      Arrow::UInt8Array.new([1, 2, 4, 8]),
      Arrow::Int32Array.new([1, -2, 4, -8]),
      Arrow::FloatArray.new([1.1, -2.2, 4.4, -8.8])
    ]
    
    record_batch = Arrow::RecordBatch.new(schema, 4, columns)
    writer.write_record_batch(record_batch)
    
    # Can write multiple batches
    sliced_columns = columns.map { |col| col.slice(1, 2) }
    record_batch2 = Arrow::RecordBatch.new(schema, 2, sliced_columns)
    writer.write_record_batch(record_batch2)
  end
end

puts "Wrote data to /tmp/data.arrow"

Read from file:

require "arrow"

Arrow::MemoryMappedInputStream.open("/tmp/data.arrow") do |input|
  reader = Arrow::RecordBatchFileReader.new(input)
  
  puts "Schema: #{reader.schema}"
  puts "Number of record batches: #{reader.n_record_batches}"
  
  reader.each_with_index do |record_batch, i|
    puts "\n=== Record Batch #{i} ==="
    puts "Rows: #{record_batch.n_rows}"
    
    reader.schema.fields.each do |field|
      values = record_batch[field.name].to_a
      puts "  #{field.name}: #{values.inspect}"
    end
  end
end

Read and Write Parquet Files

Parquet support requires the red-parquet gem.

gem install red-parquet

Write Parquet:

require "arrow"
require "parquet"

# Create table
table = Arrow::Table.new(
  "name" => Arrow::StringArray.new(["Alice", "Bob", "Carol"]),
  "age" => Arrow::UInt8Array.new([30, 25, 35]),
  "city" => Arrow::StringArray.new(["NYC", "SF", "LA"])
)

# Write to Parquet
Arrow::FileOutputStream.open("people.parquet", false) do |output|
  Parquet::ArrowFileWriter.open(table.schema, output) do |writer|
    writer.write_table(table)
  end
end

puts "Wrote people.parquet"

Read Parquet:

require "parquet"

Arrow::MemoryMappedInputStream.open("people.parquet") do |input|
  reader = Parquet::ArrowFileReader.new(input)
  table = reader.read_table
  
  puts "Read table with #{table.n_rows} rows"
  puts "Columns: #{table.column_names.join(', ')}"
  
  # Access data
  table.each_record_batch do |batch|
    batch.n_rows.times do |i|
      row = batch.schema.fields.map do |field|
        batch[field.name][i]
      end
      puts row.join(", ")
    end
  end
end

Work with Streaming Data

Arrow supports streaming for large datasets that don’t fit in memory.Write stream:

require "arrow"

fields = [
  Arrow::Field.new("id", :int32),
  Arrow::Field.new("value", :float)
]
schema = Arrow::Schema.new(fields)

Arrow::FileOutputStream.open("/tmp/stream.arrow", false) do |output|
  Arrow::RecordBatchStreamWriter.open(output, schema) do |writer|
    # Write multiple batches
    3.times do |batch_num|
      ids = Arrow::Int32Array.new([batch_num * 10, batch_num * 10 + 1])
      values = Arrow::FloatArray.new([rand * 100, rand * 100])
      
      batch = Arrow::RecordBatch.new(schema, 2, [ids, values])
      writer.write_record_batch(batch)
    end
  end
end

puts "Wrote streaming data"

Read stream:

Arrow::MemoryMappedInputStream.open("/tmp/stream.arrow") do |input|
  reader = Arrow::RecordBatchStreamReader.new(input)
  
  total_rows = 0
  reader.each do |record_batch|
    puts "Batch: #{record_batch.n_rows} rows"
    total_rows += record_batch.n_rows
  end
  
  puts "Total rows: #{total_rows}"
end

Complete Example

Here’s a complete working example that demonstrates common Red Arrow operations:

require "arrow"

# 1. Create structured data
fields = [
  Arrow::Field.new("product", :string),
  Arrow::Field.new("quantity", :int32),
  Arrow::Field.new("price", :float)
]
schema = Arrow::Schema.new(fields)

products = Arrow::StringArray.new(["A", "B", "A", "C", "B"])
quantities = Arrow::Int32Array.new([10, 20, 15, 5, 25])
prices = Arrow::FloatArray.new([100.0, 200.0, 100.0, 150.0, 200.0])

table = Arrow::Table.new(schema, [products, quantities, prices])

puts "=== Original Table ==="
puts "Rows: #{table.n_rows}, Columns: #{table.n_columns}"

# 2. Write to Arrow IPC file
Arrow::FileOutputStream.open("/tmp/sales.arrow", false) do |output|
  Arrow::RecordBatchFileWriter.open(output, schema) do |writer|
    table.each_record_batch do |batch|
      writer.write_record_batch(batch)
    end
  end
end

puts "\nWrote sales.arrow"

# 3. Read back and process
Arrow::MemoryMappedInputStream.open("/tmp/sales.arrow") do |input|
  reader = Arrow::RecordBatchFileReader.new(input)
  loaded_table = reader.read_table
  
  puts "\n=== Loaded Table ==="
  
  # Calculate total revenue
  quantities = loaded_table["quantity"].to_a
  prices = loaded_table["price"].to_a
  
  total_revenue = quantities.zip(prices).sum { |q, p| q * p }
  puts "Total revenue: $#{total_revenue}"
  
  # Group by product
  products_hash = Hash.new { |h, k| h[k] = { qty: 0, revenue: 0 } }
  
  loaded_table["product"].to_a.each_with_index do |product, i|
    qty = quantities[i]
    price = prices[i]
    products_hash[product][:qty] += qty
    products_hash[product][:revenue] += qty * price
  end
  
  puts "\n=== By Product ==="
  products_hash.each do |product, stats|
    puts "#{product}: Quantity=#{stats[:qty]}, Revenue=$#{stats[:revenue]}"
  end
end

Next Steps

Red Parquet

Work with Parquet files in Ruby

Red Arrow Dataset

Handle multi-file datasets

Red Arrow Flight

Build Arrow Flight RPC services

API Reference

Complete Ruby API documentation

Common Patterns

Convert Arrays to Ruby Arrays

arrow_array = Arrow::Int32Array.new([1, 2, 3, 4, 5])
ruby_array = arrow_array.to_a
puts ruby_array.inspect  # => [1, 2, 3, 4, 5]

Iterate Over Records

table.each_record_batch do |batch|
  batch.n_rows.times do |i|
    # Access each field
    row_data = batch.schema.fields.map do |field|
      batch[field.name][i]
    end
    puts row_data.join(", ")
  end
end

Filter Data

# Convert to Ruby array, filter, convert back
ages = Arrow::UInt8Array.new([20, 30, 40, 50])
filtered = ages.to_a.select { |age| age > 25 }
filtered_array = Arrow::UInt8Array.new(filtered)
puts filtered_array  # => [30, 40, 50]

Handle Large Files with Streaming

# Read large files in batches
Arrow::MemoryMappedInputStream.open("large_file.arrow") do |input|
  reader = Arrow::RecordBatchStreamReader.new(input)
  
  reader.each do |batch|
    # Process batch without loading entire file
    process_batch(batch)
  end
end

Performance Tips

Use memory-mapped files for large data

Memory-mapped I/O is more efficient for large files:

# Preferred for large files
Arrow::MemoryMappedInputStream.open("large.arrow") do |input|
  reader = Arrow::RecordBatchFileReader.new(input)
  table = reader.read_table
end

Process in batches

For very large data, process in batches instead of loading entire tables:

reader.each_with_index do |batch, i|
  # Process one batch at a time
  process_batch(batch)
end

Use appropriate data types

Choose the smallest data type that fits your data:

# Use uint8 for small integers (0-255)
Arrow::UInt8Array.new([1, 2, 3])

# Use int32 for larger integers
Arrow::Int32Array.new([1000, 2000, 3000])

Reuse schema objects

Create schema once and reuse for multiple batches:

schema = Arrow::Schema.new(fields)

writer.open do |w|
  batches.each do |data|
    batch = Arrow::RecordBatch.new(schema, data.size, data)
    w.write_record_batch(batch)
  end
end

Troubleshooting

Installation fails with GLib errors

Make sure Arrow GLib is installed:

# macOS
brew install apache-arrow-glib

# Ubuntu/Debian
sudo apt-get install libarrow-glib-dev

# Then install the gem
gem install red-arrow

Cannot load 'arrow' gem

Ensure GObject Introspection is working:

require "gobject-introspection"
# If this fails, install gobject-introspection gem
gem install gobject-introspection

Type mismatch errors

Ensure array lengths match when creating tables:

# All arrays must have the same length
arr1 = Arrow::Int32Array.new([1, 2, 3])
arr2 = Arrow::StringArray.new(["a", "b", "c"])  # Same length

table = Arrow::Table.new("col1" => arr1, "col2" => arr2)

File permission errors

Make sure you have write permissions:

# Use false parameter to create new file
Arrow::FileOutputStream.open("/tmp/data.arrow", false) do |output|
  # Write data
end

Installation

Quickstart Guides

Prerequisites

Using RubyGems with Automatic System Dependencies

Using Bundler

Manual Installation of Arrow GLib

Complete Example

Next Steps

Red Parquet

Red Arrow Dataset

Red Arrow Flight

API Reference

Common Patterns

Convert Arrays to Ruby Arrays

Iterate Over Records

Filter Data

Handle Large Files with Streaming

Performance Tips

Troubleshooting

Build docs developers (and LLMs) love

Installation

Quickstart Guides

​Prerequisites

​Using RubyGems with Automatic System Dependencies

​Using Bundler

​Manual Installation of Arrow GLib

​Complete Example

​Next Steps

Red Parquet

Red Arrow Dataset

Red Arrow Flight

API Reference

​Common Patterns

​Convert Arrays to Ruby Arrays

​Iterate Over Records

​Filter Data

​Handle Large Files with Streaming

​Performance Tips

​Troubleshooting

Build docs developers (and LLMs) love

Prerequisites

Using RubyGems with Automatic System Dependencies

Using Bundler

Manual Installation of Arrow GLib

Complete Example

Next Steps

Common Patterns

Convert Arrays to Ruby Arrays

Iterate Over Records

Filter Data

Handle Large Files with Streaming

Performance Tips

Troubleshooting