Skip to main content
Arrays and tables are the fundamental data structures in Apache Arrow. Arrays represent one-dimensional sequences of values, while tables organize multiple arrays into two-dimensional datasets.

Arrays

The arrow::Array class represents a known-length sequence of values with a single data type. Arrays are immutable after construction.

Array Structure

Internally, arrays consist of:
  • Type: Logical data type (int32, utf8, list, etc.)
  • Length: Number of values
  • Null count: Number of null values
  • Buffers: One or more buffers containing:
    • Validity bitmap (null indicators)
    • Value data
    • Offsets (for variable-length types)
#include <arrow/api.h>

auto int64_array = std::static_pointer_cast<arrow::Int64Array>(array);

// Get null bitmap pointer
const uint8_t* null_bitmap = int64_array->null_bitmap_data();

// Get raw value data
const int64_t* data = int64_array->raw_values();

// Check if value is null
if (!int64_array->IsNull(index)) {
    int64_t value = int64_array->Value(index);
}

Building Arrays

Use ArrayBuilder classes to construct arrays incrementally:
#include <arrow/api.h>

arrow::Int64Builder builder;
builder.Append(1);
builder.Append(2);
builder.Append(3);
builder.AppendNull();  // Add null value
builder.Append(5);

std::shared_ptr<arrow::Array> array;
ARROW_RETURN_NOT_OK(builder.Finish(&array));
For best performance, call Reserve() or Resize() before building, and use bulk AppendValues() methods when possible.

String and Binary Arrays

String and binary types use offset buffers for variable-length data:
#include <arrow/api.h>

arrow::StringBuilder builder;
builder.Append("Hello");
builder.Append("World");
builder.AppendNull();
builder.Append("Arrow");

std::shared_ptr<arrow::Array> array;
ARROW_RETURN_NOT_OK(builder.Finish(&array));

auto string_array = std::static_pointer_cast<arrow::StringArray>(array);
std::cout << string_array->GetString(0) << std::endl;  // "Hello"

List Arrays

List arrays contain nested sequences:
#include <arrow/api.h>

// Build list<int32>
auto value_builder = std::make_shared<arrow::Int32Builder>();
arrow::ListBuilder list_builder(arrow::default_memory_pool(), value_builder);

// Append [1, 2, 3]
list_builder.Append();
value_builder->Append(1);
value_builder->Append(2);
value_builder->Append(3);

// Append [4, 5]
list_builder.Append();
value_builder->Append(4);
value_builder->Append(5);

// Append null
list_builder.AppendNull();

std::shared_ptr<arrow::Array> array;
ARROW_RETURN_NOT_OK(list_builder.Finish(&array));

Struct Arrays

Struct arrays bundle multiple fields together:
#include <arrow/api.h>

// Create builders for each field
arrow::Int32Builder id_builder;
arrow::StringBuilder name_builder;

// Append values
id_builder.Append(1);
name_builder.Append("Alice");

id_builder.Append(2);
name_builder.Append("Bob");

// Finish individual arrays
std::shared_ptr<arrow::Array> id_array, name_array;
ARROW_RETURN_NOT_OK(id_builder.Finish(&id_array));
ARROW_RETURN_NOT_OK(name_builder.Finish(&name_array));

// Create struct array
auto fields = {
    arrow::field("id", arrow::int32()),
    arrow::field("name", arrow::utf8())
};

ARROW_ASSIGN_OR_RAISE(auto struct_array,
    arrow::StructArray::Make({id_array, name_array}, fields));

Creating Arrays from JSON

For testing and prototyping, use the JSON helpers:
#include <arrow/json/from_string.h>

using arrow::json::ArrayFromJSONString;

// Simple types
ARROW_ASSIGN_OR_RAISE(auto int_array,
    ArrayFromJSONString(arrow::int32(), "[1, 2, 3, null, 5]"));

ARROW_ASSIGN_OR_RAISE(auto str_array,
    ArrayFromJSONString(arrow::utf8(), 
        R"(["Hello", "World", null, "Arrow"])"));

// Complex types
ARROW_ASSIGN_OR_RAISE(auto list_array,
    ArrayFromJSONString(arrow::list(arrow::int64()),
        "[[1, 2], null, [3, 4, 5], []]"));

ARROW_ASSIGN_OR_RAISE(auto struct_array,
    ArrayFromJSONString(
        arrow::struct_({
            arrow::field("x", arrow::int32()),
            arrow::field("y", arrow::int32())
        }),
        "[[1, 2], [3, 4], null]"));
The JSON helpers are convenient but not optimized for performance. For production workloads, use the appropriate builders or read from CSV/Parquet.

Chunked Arrays

ChunkedArray represents a logical sequence of values split across multiple array chunks:
#include <arrow/api.h>

std::vector<std::shared_ptr<arrow::Array>> chunks;

// Build first chunk
arrow::Int64Builder builder;
builder.AppendValues({1, 2, 3});
std::shared_ptr<arrow::Array> chunk1;
ARROW_RETURN_NOT_OK(builder.Finish(&chunk1));
chunks.push_back(chunk1);

// Build second chunk
builder.Reset();
builder.AppendValues({4, 5, 6, 7});
std::shared_ptr<arrow::Array> chunk2;
ARROW_RETURN_NOT_OK(builder.Finish(&chunk2));
chunks.push_back(chunk2);

// Create chunked array
auto chunked = std::make_shared<arrow::ChunkedArray>(chunks);

std::cout << "Chunks: " << chunked->num_chunks() << std::endl;      // 2
std::cout << "Length: " << chunked->length() << std::endl;          // 7
std::cout << "Null count: " << chunked->null_count() << std::endl;  // 0

When to Use Chunked Arrays

Chunked arrays are useful for:
  • Large datasets that exceed 2^31 elements (array size limit)
  • Incremental construction without pre-knowing total size
  • Memory efficiency when working with multiple sources
  • Tables which always use chunked arrays internally

Schemas

Schemas describe the structure of tabular data:
#include <arrow/api.h>

// Create fields
auto field_a = arrow::field("id", arrow::int32());
auto field_b = arrow::field("name", arrow::utf8());
auto field_c = arrow::field("score", arrow::float64());

// Create schema
auto schema = arrow::schema({field_a, field_b, field_c});

// Access schema properties
std::cout << "Fields: " << schema->num_fields() << std::endl;
std::cout << "Field 0: " << schema->field(0)->name() << std::endl;

// Add metadata
auto metadata = arrow::key_value_metadata({
    {"source", "database"},
    {"version", "1.0"}
});
auto schema_with_meta = schema->WithMetadata(metadata);

Tables

Table is a two-dimensional dataset with chunked arrays as columns:
#include <arrow/api.h>

// Build arrays
arrow::Int32Builder id_builder;
arrow::StringBuilder name_builder;

id_builder.AppendValues({1, 2, 3, 4});
name_builder.AppendValues({"Alice", "Bob", "Charlie", "David"});

std::shared_ptr<arrow::Array> id_array, name_array;
ARROW_RETURN_NOT_OK(id_builder.Finish(&id_array));
ARROW_RETURN_NOT_OK(name_builder.Finish(&name_array));

// Create schema
auto schema = arrow::schema({
    arrow::field("id", arrow::int32()),
    arrow::field("name", arrow::utf8())
});

// Create table
auto table = arrow::Table::Make(schema, {id_array, name_array});

std::cout << "Rows: " << table->num_rows() << std::endl;       // 4
std::cout << "Columns: " << table->num_columns() << std::endl;  // 2

// Access columns
std::shared_ptr<arrow::ChunkedArray> col0 = table->column(0);
std::cout << "Column 0 name: " << table->schema()->field(0)->name() << std::endl;

Table Operations

// Slice rows
std::shared_ptr<arrow::Table> slice = table->Slice(1, 2);
// Returns rows 1-2 (length 2)

// Slice from offset to end
std::shared_ptr<arrow::Table> tail = table->Slice(2);

Record Batches

RecordBatch is similar to a table but with contiguous (non-chunked) arrays:
#include <arrow/api.h>

// Create arrays (same as table example)
arrow::Int32Builder id_builder;
arrow::StringBuilder name_builder;

id_builder.AppendValues({1, 2, 3});
name_builder.AppendValues({"Alice", "Bob", "Charlie"});

std::shared_ptr<arrow::Array> id_array, name_array;
ARROW_RETURN_NOT_OK(id_builder.Finish(&id_array));
ARROW_RETURN_NOT_OK(name_builder.Finish(&name_array));

// Create schema
auto schema = arrow::schema({
    arrow::field("id", arrow::int32()),
    arrow::field("name", arrow::utf8())
});

// Create record batch
auto batch = arrow::RecordBatch::Make(
    schema, 
    3,  // num_rows
    {id_array, name_array}
);

Table vs Record Batch

FeatureTableRecord Batch
ColumnsChunked arraysContiguous arrays
SizeCan be very largeLimited by chunk size
Use caseWorking with large datasetsIPC, serialization, incremental processing
PortabilityC++ concept onlyPart of Arrow format (IPC)

Converting Between Tables and Batches

// Table to record batches
auto reader = std::make_shared<arrow::TableBatchReader>(*table);
reader->set_chunksize(1000);  // Batch size

std::shared_ptr<arrow::RecordBatch> batch;
while (true) {
    ARROW_RETURN_NOT_OK(reader->ReadNext(&batch));
    if (batch == nullptr) break;
    // Process batch
}

// Record batches to table
std::vector<std::shared_ptr<arrow::RecordBatch>> batches = {batch1, batch2};
ARROW_ASSIGN_OR_RAISE(auto table,
    arrow::Table::FromRecordBatches(batches));

Slicing

Zero-copy slicing creates views into existing data:
// Slice array (offset, length)
std::shared_ptr<arrow::Array> slice = array->Slice(2, 5);

// Slice from offset to end
std::shared_ptr<arrow::Array> tail = array->Slice(10);

// Slice chunked array
std::shared_ptr<arrow::ChunkedArray> chunked_slice = 
    chunked_array->Slice(100, 50);

// Slice table
std::shared_ptr<arrow::Table> table_slice = table->Slice(0, 1000);

// Slice record batch
std::shared_ptr<arrow::RecordBatch> batch_slice = batch->Slice(10, 90);
Slicing is a zero-copy operation. The sliced object shares memory with the original, making it very efficient.

Size Limitations

Be aware of Arrow’s size constraints:
  • Arrays: Most implementations limit arrays to 2^31 elements
  • String/Binary arrays: Limited to 2GB of data (use LargeString/LargeBinary for more)
  • List arrays: Limited to 2^31 elements (use LargeList for more)
For large datasets, use chunked arrays or tables to work around single-array size limits.

Best Practices

  1. Pre-allocate: Call Reserve() before building arrays
  2. Use bulk methods: AppendValues() is faster than individual Append() calls
  3. Prefer chunked arrays: For large or growing datasets
  4. Zero-copy when possible: Use slicing instead of copying
  5. Check status: Always check Status and Result<T> return values
  6. Immutability: Remember arrays are immutable after construction

Build docs developers (and LLMs) love