Skip to main content

Table

A Table is a two-dimensional dataset with chunked arrays for columns.
import pyarrow as pa

data = {
    'n_legs': [2, 4, 5, 100],
    'animals': ['Flamingo', 'Horse', 'Brittle stars', 'Centipede']
}
table = pa.table(data)
print(table)

table()

Create a pyarrow.Table from Python data.
pa.table(data, schema=None, metadata=None)
data
dict, list, pandas.DataFrame
Dictionary of column names to arrays/lists, list of arrays, or pandas DataFrame.
schema
pyarrow.Schema
default:"None"
If provided, dictates the schema of the table. Otherwise inferred.
metadata
dict
default:"None"
Optional metadata for the schema.
table
pyarrow.Table
A Table constructed from the input data.

Properties

schema

Return the table’s schema.
table = pa.table({'a': [1, 2], 'b': [3, 4]})
print(table.schema)
# a: int64
# b: int64
schema
pyarrow.Schema
The table’s schema.

num_rows

Number of rows in the table.
table = pa.table({'a': [1, 2, 3]})
print(table.num_rows)  # 3
num_rows
int
The number of rows.

num_columns

Number of columns in the table.
table = pa.table({'a': [1, 2], 'b': [3, 4]})
print(table.num_columns)  # 2
num_columns
int
The number of columns.

column_names

List of column names.
table = pa.table({'a': [1, 2], 'b': [3, 4]})
print(table.column_names)  # ['a', 'b']
column_names
list of str
The column names as a list.

columns

List of all columns as ChunkedArray objects.
table = pa.table({'a': [1, 2], 'b': [3, 4]})
for col in table.columns:
    print(col)
columns
list of ChunkedArray
The columns as ChunkedArray objects.

Methods

column()

Select a column by name or index.
table.column(i)
i
int or str
Column index (int) or name (str).
column
pyarrow.ChunkedArray
The selected column.

select()

Select columns by names.
table.select(columns)
columns
list of str
List of column names to select.
table
pyarrow.Table
A new table with only the selected columns.

slice()

Compute a zero-copy slice of this table.
table.slice(offset=0, length=None)
offset
int
default:"0"
Offset from start of table to slice.
length
int
default:"None"
Length of slice (default is until end of table).
table
pyarrow.Table
A zero-copy slice of the table.

filter()

Filter rows of the table using a boolean selection filter.
table.filter(mask, null_selection_behavior='drop')
mask
Array or compute.Expression
Boolean array or expression to filter rows.
null_selection_behavior
str
default:"'drop'"
How to handle null values in mask. Options: ‘drop’, ‘emit_null’.
table
pyarrow.Table
A filtered table.

take()

Select rows by indices.
table.take(indices)
indices
Array or list
Indices of rows to select.
table
pyarrow.Table
A table with selected rows.

sort_by()

Sort table by one or more columns.
table.sort_by(sorting, **kwargs)
sorting
str, list of (name, order) tuples
Column name or list of (name, order) tuples where order is ‘ascending’ or ‘descending’.
table
pyarrow.Table
A sorted table.

group_by()

Group table by one or more columns.
table.group_by(keys)
keys
str or list of str
Column name(s) to group by.
grouped
pyarrow.TableGroupBy
A grouped table object that can be aggregated.

to_pydict()

Convert table to a Python dictionary.
table = pa.table({'a': [1, 2], 'b': [3, 4]})
py_dict = table.to_pydict()
print(py_dict)  # {'a': [1, 2], 'b': [3, 4]}
dict
dict
A dictionary with column names as keys and Python lists as values.

to_pandas()

Convert table to a pandas DataFrame.
table.to_pandas(self_destruct=False, **kwargs)
self_destruct
bool
default:"False"
If True, destroy the source table to save memory.
dataframe
pandas.DataFrame
A pandas DataFrame with the table data.

add_column()

Add a column to the table.
table.add_column(i, field_, column)
i
int
Index where to insert the column.
field_
str or Field
Column name or Field object.
column
Array or ChunkedArray
Column data.
table
pyarrow.Table
A new table with the added column.

remove_column()

Remove a column from the table.
table.remove_column(i)
i
int
Index of column to remove.
table
pyarrow.Table
A new table without the specified column.

rename_columns()

Rename columns in the table.
table.rename_columns(names)
names
list of str
New column names.
table
pyarrow.Table
A new table with renamed columns.

RecordBatch

A RecordBatch is a collection of equal-length arrays.
import pyarrow as pa

data = [
    pa.array([1, 2, 3]),
    pa.array(['a', 'b', 'c'])
]
schema = pa.schema([('nums', pa.int64()), ('letters', pa.string())])
batch = pa.record_batch(data, schema=schema)
print(batch)

record_batch()

Create a RecordBatch from arrays or a dictionary.
pa.record_batch(data, schema=None, metadata=None)
data
list of Array, dict
List of arrays or dictionary of column names to arrays.
schema
pyarrow.Schema
default:"None"
Schema defining the batch structure.
metadata
dict
default:"None"
Optional metadata.
batch
pyarrow.RecordBatch
A RecordBatch with the specified data.

Properties

schema

The schema of the record batch.
schema
pyarrow.Schema
The batch’s schema.

num_rows

Number of rows in the batch.
num_rows
int
The number of rows.

num_columns

Number of columns in the batch.
num_columns
int
The number of columns.

Methods

to_pydict()

Convert to a Python dictionary.
dict
dict
Dictionary with column names as keys.

to_pandas()

Convert to a pandas DataFrame.
dataframe
pandas.DataFrame
A pandas DataFrame.

Utility Functions

concat_tables()

Concatenate multiple tables into one.
pa.concat_tables(tables, promote=False, memory_pool=None)
tables
list of Table
Tables to concatenate.
promote
bool
default:"False"
If True, promote types to the widest type.
memory_pool
pyarrow.MemoryPool
default:"None"
Memory pool for allocation.
table
pyarrow.Table
Concatenated table.

concat_batches()

Concatenate multiple record batches.
pa.concat_batches(batches, memory_pool=None)
batches
list of RecordBatch
Record batches to concatenate.
memory_pool
pyarrow.MemoryPool
default:"None"
Memory pool for allocation.
batch
pyarrow.RecordBatch
Concatenated record batch.

RecordBatchReader

Streaming reader for record batches.
import pyarrow as pa

# Create a reader from batches
batches = [pa.record_batch([[1, 2], [3, 4]], names=['a', 'b'])]
reader = pa.RecordBatchReader.from_batches(batches[0].schema, batches)

for batch in reader:
    print(batch)

from_batches()

Create a reader from a list of record batches.
RecordBatchReader.from_batches(schema, batches)
schema
pyarrow.Schema
Schema of the batches.
batches
iterable of RecordBatch
Batches to read.
reader
pyarrow.RecordBatchReader
A batch reader.

read_all()

Read all batches and return as a Table.
reader.read_all()
table
pyarrow.Table
All batches combined into a table.

read_next_batch()

Read the next batch.
reader.read_next_batch()
batch
pyarrow.RecordBatch
The next batch, or raises StopIteration if exhausted.

Build docs developers (and LLMs) love