Python Library Overview

What is PyArrow?

PyArrow is the Python implementation of Apache Arrow. It provides a cross-language development platform for in-memory data with a standardized columnar memory format for efficient analytic operations on modern hardware.

import pyarrow as pa

# Display version information
pa.__version__

Key Features

PyArrow provides:

Columnar Memory Format: Efficient in-memory representation of tabular data
Zero-Copy Reads: Share data between processes without serialization overhead
Compute Functions: Vectorized operations on arrays and tables
File Formats: Read/write Parquet, CSV, JSON, Feather, ORC
Dataset API: Work with multi-file, partitioned datasets
Pandas Integration: Seamless conversion to/from pandas DataFrames

Core Data Types

Arrays

Arrays are the fundamental data structure in PyArrow:

import pyarrow as pa

# Create an array from Python list
array = pa.array([1, 2, 3, 4, 5])
print(array)
# <pyarrow.lib.Int64Array object at ...>
# [
#   1,
#   2,
#   3,
#   4,
#   5
# ]

# Specify data type explicitly
array = pa.array([1, 2, 3], type=pa.int8())
print(array.type)  # int8

Tables

Tables are collections of named columns:

import pyarrow as pa

# Create a table from Python dictionaries
table = pa.table({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [95.5, 87.3, 92.1]
})

print(table)
print(f"Columns: {table.column_names}")
print(f"Schema: {table.schema}")

Type System

PyArrow provides a rich type system:

import pyarrow as pa

# Numeric types
pa.int8(), pa.int16(), pa.int32(), pa.int64()
pa.uint8(), pa.uint16(), pa.uint32(), pa.uint64()
pa.float16(), pa.float32(), pa.float64()

# String and binary
pa.string(), pa.utf8()  # UTF-8 encoded strings
pa.binary()  # Variable-length binary
pa.large_string(), pa.large_binary()  # For large data

# Temporal types
pa.timestamp('ms')  # Millisecond timestamp
pa.timestamp('us', tz='UTC')  # With timezone
pa.date32(), pa.date64()
pa.time32('s'), pa.time64('us')

# Nested types
pa.list_(pa.int32())  # List of integers
pa.struct([('x', pa.int32()), ('y', pa.float64())])  # Struct
pa.map_(pa.string(), pa.int32())  # Map/dictionary

# Decimal types
pa.decimal128(10, 2)  # Precision 10, scale 2

Schemas

Schemas define the structure of tables:

import pyarrow as pa

# Create a schema
schema = pa.schema([
    ('id', pa.int32()),
    ('name', pa.string()),
    ('timestamp', pa.timestamp('ms')),
    pa.field('metadata', pa.map_(pa.string(), pa.string()), nullable=True)
])

print(schema)

# Access field information
for field in schema:
    print(f"{field.name}: {field.type}, nullable={field.nullable}")

Pandas Integration

Arrow to Pandas
Pandas to Arrow
Zero-Copy

import pyarrow as pa
import pandas as pd

# Create Arrow table
table = pa.table({
    'a': [1, 2, 3],
    'b': ['x', 'y', 'z']
})

# Convert to pandas DataFrame
df = table.to_pandas()
print(type(df))  # <class 'pandas.core.frame.DataFrame'>

import pyarrow as pa
import pandas as pd

# Create pandas DataFrame
df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': ['x', 'y', 'z']
})

# Convert to Arrow table
table = pa.Table.from_pandas(df)
print(table.schema)

import pyarrow as pa
import pandas as pd

# Zero-copy read from pandas
df = pd.DataFrame({'a': [1, 2, 3]})
table = pa.Table.from_pandas(df, preserve_index=False)

# Access underlying Arrow data without copying
arrow_array = table['a']

Memory Management

PyArrow provides fine-grained memory control:

import pyarrow as pa

# Get default memory pool
pool = pa.default_memory_pool()
print(f"Bytes allocated: {pool.bytes_allocated()}")
print(f"Backend: {pool.backend_name}")

# Create a custom memory pool
from pyarrow import jemalloc_memory_pool
if pa.jemalloc_memory_pool:
    custom_pool = pa.jemalloc_memory_pool()
    pa.set_memory_pool(custom_pool)

PyArrow uses reference counting for memory management. Memory is automatically freed when objects go out of scope.

System Information

Check your PyArrow installation:

import pyarrow as pa

# Show detailed version and build information
pa.show_versions()

# Show full system information
pa.show_info()

Output includes:

PyArrow and Arrow C++ versions
Compiler information
Available modules (parquet, dataset, etc.)
Filesystem support (S3, GCS, HDFS)
Compression codecs

Performance Tips

Use PyArrow for:

Large datasets that don’t fit in memory (with Dataset API)
Zero-copy data sharing between processes
Efficient columnar operations
Fast serialization/deserialization
Integration with pandas, NumPy, and other libraries

Next Steps

Installation - Set up PyArrow in your environment
Tables and Arrays - Work with columnar data structures
Compute Functions - Perform vectorized operations
Dataset API - Handle multi-file datasets
Parquet Files - Read and write Parquet format

Common Patterns

Creating Sample Data

import pyarrow as pa
from datetime import datetime

# Create a table with various types
table = pa.table({
    'id': pa.array([1, 2, 3], type=pa.int32()),
    'timestamp': pa.array([datetime.now()] * 3),
    'values': pa.array([[1, 2], [3, 4], [5, 6]], type=pa.list_(pa.int64())),
    'metadata': pa.array([{'key': 'value'}] * 3, type=pa.map_(pa.string(), pa.string()))
})

Handling Nulls

import pyarrow as pa

# Arrays can contain null values
array = pa.array([1, None, 3, None, 5])
print(f"Null count: {array.null_count}")
print(f"Is valid: {array.is_valid()}")

# Check individual elements
for i in range(len(array)):
    if array[i].is_valid:
        print(f"Index {i}: {array[i].as_py()}")
    else:
        print(f"Index {i}: NULL")

Reference

Official Documentation: https://arrow.apache.org/docs/python/
API Reference: https://arrow.apache.org/docs/python/api.html
GitHub: https://github.com/apache/arrow

C++

Python

R

Ruby

Other Languages

Python Library Overview

What is PyArrow?

Key Features

Core Data Types

Arrays

Tables

Type System

Schemas

Pandas Integration

Memory Management

System Information

Performance Tips

Next Steps

Common Patterns

Creating Sample Data

Handling Nulls

Reference

Build docs developers (and LLMs) love

C++

Python

R

Ruby

Other Languages

​What is PyArrow?

​Key Features

​Core Data Types

​Arrays

​Tables

​Type System

​Schemas

​Pandas Integration

​Memory Management

​System Information

​Performance Tips

​Next Steps

​Common Patterns

​Creating Sample Data

​Handling Nulls

​Reference

Build docs developers (and LLMs) love

What is PyArrow?

Key Features

Core Data Types

Arrays

Tables

Type System

Schemas

Pandas Integration

Memory Management

System Information

Performance Tips

Next Steps

Common Patterns

Creating Sample Data

Handling Nulls

Reference