Installation and Setup

Installation

PyArrow is available via multiple package managers:

pip
conda
Development

# Install latest stable version
pip install pyarrow

# Install specific version
pip install pyarrow==15.0.0

# Install with extras (optional dependencies)
pip install pyarrow[parquet]

# Install from conda-forge channel (recommended)
conda install -c conda-forge pyarrow

# Install specific version
conda install -c conda-forge pyarrow=15.0.0

# Install development version from source
pip install git+https://github.com/apache/arrow.git@main#subdirectory=python

# Or clone and build locally
git clone https://github.com/apache/arrow.git
cd arrow/python
pip install -e .

Verifying Installation

Check that PyArrow is installed correctly:

import pyarrow as pa

# Display version
print(f"PyArrow version: {pa.__version__}")

# Check C++ library version
print(f"Arrow C++ version: {pa.cpp_version}")

# Show available modules
pa.show_versions()

System Requirements

Python Version

PyArrow supports:

Python 3.8 or later
CPython implementation (PyPy not officially supported)

Operating Systems

Linux: x86_64, aarch64 (ARM64)
macOS: x86_64 (Intel), arm64 (Apple Silicon)
Windows: x86_64

Optional Dependencies

PyArrow has optional dependencies for additional functionality:

# Pandas integration
pip install pandas

# NumPy for array operations
pip install numpy

# Cloud filesystems
pip install pyarrow[s3]  # AWS S3 support
pip install pyarrow[gcs]  # Google Cloud Storage
pip install pyarrow[azure]  # Azure Blob Storage

# All optional dependencies
pip install pyarrow[all]

Configuration

Memory Pool

Configure the memory allocator:

import pyarrow as pa

# Use jemalloc (if available)
if pa.jemalloc_memory_pool:
    pa.set_memory_pool(pa.jemalloc_memory_pool())

# Or use system allocator
pa.set_memory_pool(pa.system_memory_pool())

# Check current pool
pool = pa.default_memory_pool()
print(f"Using memory pool: {pool.backend_name}")

CPU Count

Control parallelism:

import pyarrow as pa

# Get default CPU count
print(f"Default CPU count: {pa.cpu_count()}")

# Set custom CPU count for parallel operations
pa.set_cpu_count(4)

# Get I/O thread pool size
print(f"I/O threads: {pa.io_thread_count()}")

# Set I/O thread count
pa.set_io_thread_count(8)

Environment Variables

PyArrow respects several environment variables:

# Set Arrow home directory
export ARROW_HOME=/path/to/arrow

# Disable memory pooling
export ARROW_DEFAULT_MEMORY_POOL=system

# Set timezone database path
export ARROW_TIMEZONE_DATABASE=/path/to/tzdata

Building from Source

Building from source is only necessary if you need development features or custom build options.

Prerequisites

# Ubuntu/Debian
sudo apt-get install build-essential cmake

# macOS
brew install cmake

# Windows
# Install Visual Studio 2019 or later with C++ support

Build Steps

# Clone the repository
git clone https://github.com/apache/arrow.git
cd arrow

# Create build directory
mkdir cpp/build
cd cpp/build

# Configure with CMake
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DARROW_PYTHON=ON \
  -DARROW_PARQUET=ON \
  -DARROW_DATASET=ON

# Build C++ library
make -j8

# Install Python package
cd ../../python
pip install -e .

Build Options

Minimal Build
With Parquet
Full Features

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DARROW_PYTHON=ON

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DARROW_PYTHON=ON \
  -DARROW_PARQUET=ON

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DARROW_PYTHON=ON \
  -DARROW_PARQUET=ON \
  -DARROW_DATASET=ON \
  -DARROW_CSV=ON \
  -DARROW_JSON=ON \
  -DARROW_COMPUTE=ON \
  -DARROW_S3=ON

Checking Available Features

Verify which features are enabled:

import pyarrow as pa

# Check module availability
modules = ['csv', 'parquet', 'dataset', 'compute', 'flight', 'fs']
for module in modules:
    try:
        __import__(f'pyarrow.{module}')
        print(f"{module}: Available")
    except ImportError:
        print(f"{module}: Not available")

# Check filesystem support
from pyarrow import fs
print(f"S3: {hasattr(fs, 'S3FileSystem')}")
print(f"GCS: {hasattr(fs, 'GcsFileSystem')}")
print(f"HDFS: {hasattr(fs, 'HadoopFileSystem')}")

# Check compression codecs
from pyarrow import Codec
codecs = ['gzip', 'snappy', 'lz4', 'zstd', 'brotli']
for codec in codecs:
    print(f"{codec}: {Codec.is_available(codec)}")

Troubleshooting

Import Errors

If you encounter import errors:

# Check if PyArrow is installed
try:
    import pyarrow as pa
    print(f"PyArrow {pa.__version__} found")
except ImportError as e:
    print(f"PyArrow not found: {e}")
    print("Install with: pip install pyarrow")

Version Conflicts

Check for version mismatches:

import pyarrow as pa

# Show detailed build information
print(pa.build_info)
print(f"Package kind: {pa.build_info.cpp_build_info.package_kind}")
print(f"Build type: {pa.build_info.build_type}")

Memory Issues

If you encounter memory problems:

import pyarrow as pa

# Monitor memory usage
pool = pa.default_memory_pool()
print(f"Allocated: {pool.bytes_allocated()} bytes")
print(f"Max memory: {pool.max_memory()} bytes")

# Use a different allocator
if pa.jemalloc_memory_pool:
    pa.set_memory_pool(pa.jemalloc_memory_pool())

Library Symlinks

On Linux/macOS, create library symlinks if needed:

import pyarrow as pa

# Create symlinks for bundled libraries
# (Only needed for building C++ extensions against PyArrow)
pa.create_library_symlinks()

IDE Setup

VS Code
PyCharm
Jupyter

// .vscode/settings.json
{
  "python.linting.enabled": true,
  "python.linting.pylintEnabled": true,
  "python.analysis.extraPaths": [],
  "python.autoComplete.extraPaths": []
}

# Install Jupyter
pip install jupyter

# Start Jupyter notebook
jupyter notebook

# In notebook:
import pyarrow as pa
pa.show_versions()

Docker Setup

Use PyArrow in Docker:

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Install PyArrow
RUN pip install --no-cache-dir pyarrow pandas

# Copy application
COPY . /app
WORKDIR /app

CMD ["python", "app.py"]

Next Steps

Now that PyArrow is installed:

Overview - Learn about PyArrow’s core concepts
Tables and Arrays - Work with columnar data
Compute Functions - Perform vectorized operations
Parquet Files - Read and write Parquet files

C++

Python

R

Ruby

Other Languages

Installation

Verifying Installation

System Requirements

Python Version

Operating Systems

Optional Dependencies

Configuration

Memory Pool

CPU Count

Environment Variables

Building from Source

Prerequisites

Build Steps

Build Options

Checking Available Features

Troubleshooting

Import Errors

Version Conflicts

Memory Issues

Library Symlinks

IDE Setup

Docker Setup

Next Steps

Build docs developers (and LLMs) love

C++

Python

R

Ruby

Other Languages

​Installation

​Verifying Installation

​System Requirements

​Python Version

​Operating Systems

​Optional Dependencies

​Configuration

​Memory Pool

​CPU Count

​Environment Variables

​Building from Source

​Prerequisites

​Build Steps

​Build Options

​Checking Available Features

​Troubleshooting

​Import Errors

​Version Conflicts

​Memory Issues

​Library Symlinks

​IDE Setup

​Docker Setup

​Next Steps

Build docs developers (and LLMs) love

Installation

Verifying Installation

System Requirements

Python Version

Operating Systems

Optional Dependencies

Configuration

Memory Pool

CPU Count

Environment Variables

Building from Source

Prerequisites

Build Steps

Build Options

Checking Available Features

Troubleshooting

Import Errors

Version Conflicts

Memory Issues

Library Symlinks

IDE Setup

Docker Setup

Next Steps