Working with Arrays and Tables

Arrays and tables are the fundamental data structures in Apache Arrow. Arrays represent one-dimensional sequences of values, while tables organize multiple arrays into two-dimensional datasets.

Arrays

The arrow::Array class represents a known-length sequence of values with a single data type. Arrays are immutable after construction.

Array Structure

Internally, arrays consist of:

Type: Logical data type (int32, utf8, list, etc.)
Length: Number of values
Null count: Number of null values
Buffers: One or more buffers containing:
- Validity bitmap (null indicators)
- Value data
- Offsets (for variable-length types)

#include <arrow/api.h>

auto int64_array = std::static_pointer_cast<arrow::Int64Array>(array);

// Get null bitmap pointer
const uint8_t* null_bitmap = int64_array->null_bitmap_data();

// Get raw value data
const int64_t* data = int64_array->raw_values();

// Check if value is null
if (!int64_array->IsNull(index)) {
    int64_t value = int64_array->Value(index);
}

Building Arrays

Use ArrayBuilder classes to construct arrays incrementally:

Value by Value
Bulk Append
Unsafe (Fast)

#include <arrow/api.h>

arrow::Int64Builder builder;
builder.Append(1);
builder.Append(2);
builder.Append(3);
builder.AppendNull();  // Add null value
builder.Append(5);

std::shared_ptr<arrow::Array> array;
ARROW_RETURN_NOT_OK(builder.Finish(&array));

#include <arrow/api.h>

arrow::Int64Builder builder;
builder.Reserve(5);  // Pre-allocate space

std::vector<int64_t> values = {1, 2, 3, 4, 5};
std::vector<bool> validity = {true, true, false, true, true};

builder.AppendValues(values, validity);

std::shared_ptr<arrow::Array> array;
ARROW_RETURN_NOT_OK(builder.Finish(&array));

#include <arrow/api.h>

arrow::Int64Builder builder;
builder.Reserve(5);  // MUST reserve first!

// Unsafe methods skip bounds checking
builder.UnsafeAppend(1);
builder.UnsafeAppend(2);
builder.UnsafeAppend(3);
builder.UnsafeAppendNull();
builder.UnsafeAppend(5);

std::shared_ptr<arrow::Array> array;
ARROW_RETURN_NOT_OK(builder.Finish(&array));

For best performance, call Reserve() or Resize() before building, and use bulk AppendValues() methods when possible.

String and Binary Arrays

String and binary types use offset buffers for variable-length data:

#include <arrow/api.h>

arrow::StringBuilder builder;
builder.Append("Hello");
builder.Append("World");
builder.AppendNull();
builder.Append("Arrow");

std::shared_ptr<arrow::Array> array;
ARROW_RETURN_NOT_OK(builder.Finish(&array));

auto string_array = std::static_pointer_cast<arrow::StringArray>(array);
std::cout << string_array->GetString(0) << std::endl;  // "Hello"

List Arrays

List arrays contain nested sequences:

#include <arrow/api.h>

// Build list<int32>
auto value_builder = std::make_shared<arrow::Int32Builder>();
arrow::ListBuilder list_builder(arrow::default_memory_pool(), value_builder);

// Append [1, 2, 3]
list_builder.Append();
value_builder->Append(1);
value_builder->Append(2);
value_builder->Append(3);

// Append [4, 5]
list_builder.Append();
value_builder->Append(4);
value_builder->Append(5);

// Append null
list_builder.AppendNull();

std::shared_ptr<arrow::Array> array;
ARROW_RETURN_NOT_OK(list_builder.Finish(&array));

Struct Arrays

Struct arrays bundle multiple fields together:

#include <arrow/api.h>

// Create builders for each field
arrow::Int32Builder id_builder;
arrow::StringBuilder name_builder;

// Append values
id_builder.Append(1);
name_builder.Append("Alice");

id_builder.Append(2);
name_builder.Append("Bob");

// Finish individual arrays
std::shared_ptr<arrow::Array> id_array, name_array;
ARROW_RETURN_NOT_OK(id_builder.Finish(&id_array));
ARROW_RETURN_NOT_OK(name_builder.Finish(&name_array));

// Create struct array
auto fields = {
    arrow::field("id", arrow::int32()),
    arrow::field("name", arrow::utf8())
};

ARROW_ASSIGN_OR_RAISE(auto struct_array,
    arrow::StructArray::Make({id_array, name_array}, fields));

Creating Arrays from JSON

For testing and prototyping, use the JSON helpers:

#include <arrow/json/from_string.h>

using arrow::json::ArrayFromJSONString;

// Simple types
ARROW_ASSIGN_OR_RAISE(auto int_array,
    ArrayFromJSONString(arrow::int32(), "[1, 2, 3, null, 5]"));

ARROW_ASSIGN_OR_RAISE(auto str_array,
    ArrayFromJSONString(arrow::utf8(), 
        R"(["Hello", "World", null, "Arrow"])"));

// Complex types
ARROW_ASSIGN_OR_RAISE(auto list_array,
    ArrayFromJSONString(arrow::list(arrow::int64()),
        "[[1, 2], null, [3, 4, 5], []]"));

ARROW_ASSIGN_OR_RAISE(auto struct_array,
    ArrayFromJSONString(
        arrow::struct_({
            arrow::field("x", arrow::int32()),
            arrow::field("y", arrow::int32())
        }),
        "[[1, 2], [3, 4], null]"));

The JSON helpers are convenient but not optimized for performance. For production workloads, use the appropriate builders or read from CSV/Parquet.

Chunked Arrays

ChunkedArray represents a logical sequence of values split across multiple array chunks:

#include <arrow/api.h>

std::vector<std::shared_ptr<arrow::Array>> chunks;

// Build first chunk
arrow::Int64Builder builder;
builder.AppendValues({1, 2, 3});
std::shared_ptr<arrow::Array> chunk1;
ARROW_RETURN_NOT_OK(builder.Finish(&chunk1));
chunks.push_back(chunk1);

// Build second chunk
builder.Reset();
builder.AppendValues({4, 5, 6, 7});
std::shared_ptr<arrow::Array> chunk2;
ARROW_RETURN_NOT_OK(builder.Finish(&chunk2));
chunks.push_back(chunk2);

// Create chunked array
auto chunked = std::make_shared<arrow::ChunkedArray>(chunks);

std::cout << "Chunks: " << chunked->num_chunks() << std::endl;      // 2
std::cout << "Length: " << chunked->length() << std::endl;          // 7
std::cout << "Null count: " << chunked->null_count() << std::endl;  // 0

When to Use Chunked Arrays

Chunked arrays are useful for:

Large datasets that exceed 2^31 elements (array size limit)
Incremental construction without pre-knowing total size
Memory efficiency when working with multiple sources
Tables which always use chunked arrays internally

Schemas

Schemas describe the structure of tabular data:

#include <arrow/api.h>

// Create fields
auto field_a = arrow::field("id", arrow::int32());
auto field_b = arrow::field("name", arrow::utf8());
auto field_c = arrow::field("score", arrow::float64());

// Create schema
auto schema = arrow::schema({field_a, field_b, field_c});

// Access schema properties
std::cout << "Fields: " << schema->num_fields() << std::endl;
std::cout << "Field 0: " << schema->field(0)->name() << std::endl;

// Add metadata
auto metadata = arrow::key_value_metadata({
    {"source", "database"},
    {"version", "1.0"}
});
auto schema_with_meta = schema->WithMetadata(metadata);

Tables

Table is a two-dimensional dataset with chunked arrays as columns:

#include <arrow/api.h>

// Build arrays
arrow::Int32Builder id_builder;
arrow::StringBuilder name_builder;

id_builder.AppendValues({1, 2, 3, 4});
name_builder.AppendValues({"Alice", "Bob", "Charlie", "David"});

std::shared_ptr<arrow::Array> id_array, name_array;
ARROW_RETURN_NOT_OK(id_builder.Finish(&id_array));
ARROW_RETURN_NOT_OK(name_builder.Finish(&name_array));

// Create schema
auto schema = arrow::schema({
    arrow::field("id", arrow::int32()),
    arrow::field("name", arrow::utf8())
});

// Create table
auto table = arrow::Table::Make(schema, {id_array, name_array});

std::cout << "Rows: " << table->num_rows() << std::endl;       // 4
std::cout << "Columns: " << table->num_columns() << std::endl;  // 2

// Access columns
std::shared_ptr<arrow::ChunkedArray> col0 = table->column(0);
std::cout << "Column 0 name: " << table->schema()->field(0)->name() << std::endl;

Table Operations

Slicing
Column Selection
Concatenation

// Slice rows
std::shared_ptr<arrow::Table> slice = table->Slice(1, 2);
// Returns rows 1-2 (length 2)

// Slice from offset to end
std::shared_ptr<arrow::Table> tail = table->Slice(2);

// Select specific columns
ARROW_ASSIGN_OR_RAISE(auto subset,
    table->SelectColumns({0, 2}));

// Remove columns
ARROW_ASSIGN_OR_RAISE(auto removed,
    table->RemoveColumn(1));

// Add column
auto new_col = /* create chunked array */;
ARROW_ASSIGN_OR_RAISE(auto expanded,
    table->AddColumn(2, arrow::field("new", arrow::int64()), new_col));

// Concatenate tables vertically (add rows)
std::vector<std::shared_ptr<arrow::Table>> tables = {table1, table2};
ARROW_ASSIGN_OR_RAISE(auto combined,
    arrow::ConcatenateTables(tables));

// Note: schemas must match

Record Batches

RecordBatch is similar to a table but with contiguous (non-chunked) arrays:

#include <arrow/api.h>

// Create arrays (same as table example)
arrow::Int32Builder id_builder;
arrow::StringBuilder name_builder;

id_builder.AppendValues({1, 2, 3});
name_builder.AppendValues({"Alice", "Bob", "Charlie"});

std::shared_ptr<arrow::Array> id_array, name_array;
ARROW_RETURN_NOT_OK(id_builder.Finish(&id_array));
ARROW_RETURN_NOT_OK(name_builder.Finish(&name_array));

// Create schema
auto schema = arrow::schema({
    arrow::field("id", arrow::int32()),
    arrow::field("name", arrow::utf8())
});

// Create record batch
auto batch = arrow::RecordBatch::Make(
    schema, 
    3,  // num_rows
    {id_array, name_array}
);

Table vs Record Batch

Feature	Table	Record Batch
Columns	Chunked arrays	Contiguous arrays
Size	Can be very large	Limited by chunk size
Use case	Working with large datasets	IPC, serialization, incremental processing
Portability	C++ concept only	Part of Arrow format (IPC)

Converting Between Tables and Batches

// Table to record batches
auto reader = std::make_shared<arrow::TableBatchReader>(*table);
reader->set_chunksize(1000);  // Batch size

std::shared_ptr<arrow::RecordBatch> batch;
while (true) {
    ARROW_RETURN_NOT_OK(reader->ReadNext(&batch));
    if (batch == nullptr) break;
    // Process batch
}

// Record batches to table
std::vector<std::shared_ptr<arrow::RecordBatch>> batches = {batch1, batch2};
ARROW_ASSIGN_OR_RAISE(auto table,
    arrow::Table::FromRecordBatches(batches));

Slicing

Zero-copy slicing creates views into existing data:

// Slice array (offset, length)
std::shared_ptr<arrow::Array> slice = array->Slice(2, 5);

// Slice from offset to end
std::shared_ptr<arrow::Array> tail = array->Slice(10);

// Slice chunked array
std::shared_ptr<arrow::ChunkedArray> chunked_slice = 
    chunked_array->Slice(100, 50);

// Slice table
std::shared_ptr<arrow::Table> table_slice = table->Slice(0, 1000);

// Slice record batch
std::shared_ptr<arrow::RecordBatch> batch_slice = batch->Slice(10, 90);

Slicing is a zero-copy operation. The sliced object shares memory with the original, making it very efficient.

Size Limitations

Be aware of Arrow’s size constraints:

Arrays: Most implementations limit arrays to 2^31 elements
String/Binary arrays: Limited to 2GB of data (use LargeString/LargeBinary for more)
List arrays: Limited to 2^31 elements (use LargeList for more)

For large datasets, use chunked arrays or tables to work around single-array size limits.

Best Practices

Pre-allocate: Call Reserve() before building arrays
Use bulk methods: AppendValues() is faster than individual Append() calls
Prefer chunked arrays: For large or growing datasets
Zero-copy when possible: Use slicing instead of copying
Check status: Always check Status and Result<T> return values
Immutability: Remember arrays are immutable after construction

C++

Python

R

Ruby

Other Languages

Working with Arrays and Tables

Arrays

Array Structure

Building Arrays

String and Binary Arrays

List Arrays

Struct Arrays

Creating Arrays from JSON

Chunked Arrays

When to Use Chunked Arrays

Schemas

Tables

Table Operations

Record Batches

Table vs Record Batch

Converting Between Tables and Batches

Slicing

Size Limitations

Best Practices

Build docs developers (and LLMs) love

C++

Python

R

Ruby

Other Languages

​Arrays

​Array Structure

​Building Arrays

​String and Binary Arrays

​List Arrays

​Struct Arrays

​Creating Arrays from JSON

​Chunked Arrays

​When to Use Chunked Arrays

​Schemas

​Tables

​Table Operations

​Record Batches

​Table vs Record Batch

​Converting Between Tables and Batches

​Slicing

​Size Limitations

​Best Practices

Build docs developers (and LLMs) love

Arrays

Array Structure

Building Arrays

String and Binary Arrays

List Arrays

Struct Arrays

Creating Arrays from JSON

Chunked Arrays

When to Use Chunked Arrays

Schemas

Tables

Table Operations

Record Batches

Table vs Record Batch

Converting Between Tables and Batches

Slicing

Size Limitations

Best Practices