Table Classes - Apache Arrow

Table

A Table is a two-dimensional dataset with chunked arrays for columns.

import pyarrow as pa

data = {
    'n_legs': [2, 4, 5, 100],
    'animals': ['Flamingo', 'Horse', 'Brittle stars', 'Centipede']
}
table = pa.table(data)
print(table)

table()

Create a pyarrow.Table from Python data.

pa.table(data, schema=None, metadata=None)

data

dict, list, pandas.DataFrame

Dictionary of column names to arrays/lists, list of arrays, or pandas DataFrame.

schema

pyarrow.Schema

default:"None"

If provided, dictates the schema of the table. Otherwise inferred.

metadata

dict

default:"None"

Optional metadata for the schema.

table

pyarrow.Table

A Table constructed from the input data.

Properties

schema

Return the table’s schema.

table = pa.table({'a': [1, 2], 'b': [3, 4]})
print(table.schema)
# a: int64
# b: int64

schema

pyarrow.Schema

The table’s schema.

num_rows

Number of rows in the table.

table = pa.table({'a': [1, 2, 3]})
print(table.num_rows)  # 3

num_rows

int

The number of rows.

num_columns

Number of columns in the table.

table = pa.table({'a': [1, 2], 'b': [3, 4]})
print(table.num_columns)  # 2

num_columns

int

The number of columns.

column_names

List of column names.

table = pa.table({'a': [1, 2], 'b': [3, 4]})
print(table.column_names)  # ['a', 'b']

column_names

list of str

The column names as a list.

columns

List of all columns as ChunkedArray objects.

table = pa.table({'a': [1, 2], 'b': [3, 4]})
for col in table.columns:
    print(col)

columns

list of ChunkedArray

The columns as ChunkedArray objects.

Methods

column()

Select a column by name or index.

table.column(i)

int or str

Column index (int) or name (str).

column

pyarrow.ChunkedArray

The selected column.

select()

Select columns by names.

table.select(columns)

columns

list of str

List of column names to select.

table

pyarrow.Table

A new table with only the selected columns.

slice()

Compute a zero-copy slice of this table.

table.slice(offset=0, length=None)

offset

int

default:"0"

Offset from start of table to slice.

length

int

default:"None"

Length of slice (default is until end of table).

table

pyarrow.Table

A zero-copy slice of the table.

filter()

Filter rows of the table using a boolean selection filter.

table.filter(mask, null_selection_behavior='drop')

mask

Array or compute.Expression

Boolean array or expression to filter rows.

null_selection_behavior

str

default:"'drop'"

How to handle null values in mask. Options: ‘drop’, ‘emit_null’.

table

pyarrow.Table

A filtered table.

take()

Select rows by indices.

table.take(indices)

indices

Array or list

Indices of rows to select.

table

pyarrow.Table

A table with selected rows.

sort_by()

Sort table by one or more columns.

table.sort_by(sorting, **kwargs)

sorting

str, list of (name, order) tuples

Column name or list of (name, order) tuples where order is ‘ascending’ or ‘descending’.

table

pyarrow.Table

A sorted table.

group_by()

Group table by one or more columns.

table.group_by(keys)

keys

str or list of str

Column name(s) to group by.

grouped

pyarrow.TableGroupBy

A grouped table object that can be aggregated.

to_pydict()

Convert table to a Python dictionary.

table = pa.table({'a': [1, 2], 'b': [3, 4]})
py_dict = table.to_pydict()
print(py_dict)  # {'a': [1, 2], 'b': [3, 4]}

dict

A dictionary with column names as keys and Python lists as values.

to_pandas()

Convert table to a pandas DataFrame.

table.to_pandas(self_destruct=False, **kwargs)

self_destruct

bool

default:"False"

If True, destroy the source table to save memory.

dataframe

pandas.DataFrame

A pandas DataFrame with the table data.

add_column()

Add a column to the table.

table.add_column(i, field_, column)

int

Index where to insert the column.

field_

str or Field

Column name or Field object.

column

Array or ChunkedArray

Column data.

table

pyarrow.Table

A new table with the added column.

remove_column()

Remove a column from the table.

table.remove_column(i)

int

Index of column to remove.

table

pyarrow.Table

A new table without the specified column.

rename_columns()

Rename columns in the table.

table.rename_columns(names)

names

list of str

New column names.

table

pyarrow.Table

A new table with renamed columns.

RecordBatch

A RecordBatch is a collection of equal-length arrays.

import pyarrow as pa

data = [
    pa.array([1, 2, 3]),
    pa.array(['a', 'b', 'c'])
]
schema = pa.schema([('nums', pa.int64()), ('letters', pa.string())])
batch = pa.record_batch(data, schema=schema)
print(batch)

record_batch()

Create a RecordBatch from arrays or a dictionary.

pa.record_batch(data, schema=None, metadata=None)

data

list of Array, dict

List of arrays or dictionary of column names to arrays.

schema

pyarrow.Schema

default:"None"

Schema defining the batch structure.

metadata

dict

default:"None"

Optional metadata.

batch

pyarrow.RecordBatch

A RecordBatch with the specified data.

Properties

schema

The schema of the record batch.

schema

pyarrow.Schema

The batch’s schema.

num_rows

Number of rows in the batch.

num_rows

int

The number of rows.

num_columns

Number of columns in the batch.

num_columns

int

The number of columns.

Methods

to_pydict()

Convert to a Python dictionary.

dict

Dictionary with column names as keys.

to_pandas()

Convert to a pandas DataFrame.

dataframe

pandas.DataFrame

A pandas DataFrame.

Utility Functions

concat_tables()

Concatenate multiple tables into one.

pa.concat_tables(tables, promote=False, memory_pool=None)

tables

list of Table

Tables to concatenate.

promote

bool

default:"False"

If True, promote types to the widest type.

memory_pool

pyarrow.MemoryPool

default:"None"

Memory pool for allocation.

table

pyarrow.Table

Concatenated table.

concat_batches()

Concatenate multiple record batches.

pa.concat_batches(batches, memory_pool=None)

batches

list of RecordBatch

Record batches to concatenate.

memory_pool

pyarrow.MemoryPool

default:"None"

Memory pool for allocation.

batch

pyarrow.RecordBatch

Concatenated record batch.

RecordBatchReader

Streaming reader for record batches.

import pyarrow as pa

# Create a reader from batches
batches = [pa.record_batch([[1, 2], [3, 4]], names=['a', 'b'])]
reader = pa.RecordBatchReader.from_batches(batches[0].schema, batches)

for batch in reader:
    print(batch)

from_batches()

Create a reader from a list of record batches.

RecordBatchReader.from_batches(schema, batches)

schema

pyarrow.Schema

Schema of the batches.

batches

iterable of RecordBatch

Batches to read.

reader

pyarrow.RecordBatchReader

A batch reader.

read_all()

Read all batches and return as a Table.

reader.read_all()

table

pyarrow.Table

All batches combined into a table.

read_next_batch()

Read the next batch.

reader.read_next_batch()

batch

pyarrow.RecordBatch

The next batch, or raises StopIteration if exhausted.

C++ API

Python API

R API

​Table

​table()

​Properties

​schema

​num_rows

​num_columns

​column_names

​columns

​Methods

​column()

​select()

​slice()

​filter()

​take()

​sort_by()

​group_by()

​to_pydict()

​to_pandas()

​add_column()

​remove_column()

​rename_columns()

​RecordBatch

​record_batch()

​Properties

​schema

​num_rows

​num_columns

​Methods

​to_pydict()

​to_pandas()

​Utility Functions

​concat_tables()

​concat_batches()

​RecordBatchReader

​from_batches()

​read_all()

​read_next_batch()

Build docs developers (and LLMs) love

Table

table()

Properties

schema

num_rows

num_columns

column_names

columns

Methods

column()

select()

slice()

filter()

take()

sort_by()

group_by()

to_pydict()

to_pandas()

add_column()

remove_column()

rename_columns()

RecordBatch

record_batch()

Properties

schema

num_rows

num_columns

Methods

to_pydict()

to_pandas()

Utility Functions

concat_tables()

concat_batches()

RecordBatchReader

from_batches()

read_all()

read_next_batch()