Skip to main content
This guide covers debugging techniques, tools, and best practices for Apache Arrow development across C++, Python, and R implementations.

Build Configuration for Debugging

Proper build configuration is essential for effective debugging.

Debug Build

Build Arrow C++ in debug mode for better debugging:
cd arrow/cpp
mkdir build-debug
cd build-debug
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DARROW_BUILD_TESTS=ON \
  -DARROW_EXTRA_ERROR_CONTEXT=ON
make -j8
Key options:
  • CMAKE_BUILD_TYPE=Debug: Disables optimizations, enables debug symbols
  • ARROW_EXTRA_ERROR_CONTEXT=ON: Provides additional error context information
  • Debug builds are slower but essential for debugging

RelWithDebInfo Build

For debugging with some optimizations:
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo
This provides a middle ground with debug symbols but some optimizations enabled.

Compiler Warning Level

Set BUILD_WARNING_LEVEL=CHECKIN for stricter warnings:
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DBUILD_WARNING_LEVEL=CHECKIN
With gcc/clang, this adds -Werror (treat warnings as errors). With MSVC, adds /WX.

Debugging Tools

GDB (GNU Debugger)

GDB is the primary debugger for C++ on Linux.
1

Launch program in GDB

gdb --args ./build-debug/arrow-array-test
2

Set breakpoints

# Break at function
(gdb) break arrow::Array::Validate

# Break at file:line
(gdb) break array.cc:123

# Conditional breakpoint
(gdb) break array.cc:123 if length > 1000
3

Run the program

(gdb) run
4

Navigate execution

(gdb) next      # Step over
(gdb) step      # Step into
(gdb) continue  # Continue to next breakpoint
(gdb) finish    # Run until function returns
5

Inspect variables

(gdb) print variable_name
(gdb) print *pointer
(gdb) print array->length()
6

View backtrace

(gdb) backtrace
(gdb) bt full  # With local variables

LLDB (LLVM Debugger)

LLDB is the primary debugger on macOS and an alternative on Linux.
# Launch in LLDB
lldb -- ./build-debug/arrow-array-test

# Set breakpoint
(lldb) breakpoint set --name arrow::Array::Validate
(lldb) b array.cc:123

# Run
(lldb) run

# Navigate
(lldb) next
(lldb) step
(lldb) continue

# Inspect
(lldb) print variable_name
(lldb) frame variable

# Backtrace
(lldb) bt

Visual Studio Code

VSCode provides excellent debugging support for Arrow:
1

Install C++ extension

Install the “C/C++” extension by Microsoft.
2

Create launch configuration

Create .vscode/launch.json:
{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Debug Arrow Test",
      "type": "cppdbg",
      "request": "launch",
      "program": "${workspaceFolder}/cpp/build-debug/arrow-array-test",
      "args": [],
      "stopAtEntry": false,
      "cwd": "${workspaceFolder}",
      "environment": [],
      "MIMode": "gdb"
    }
  ]
}
3

Set breakpoints and debug

Click in the gutter next to line numbers to set breakpoints, then press F5 to start debugging.

Python Debugging

Built-in Python debugger:
import pyarrow as pa
import pdb

# Set breakpoint
pdb.set_trace()

# Or use breakpoint() in Python 3.7+
breakpoint()

array = pa.array([1, 2, 3])
Common pdb commands:
n       # Next line
s       # Step into
c       # Continue
l       # List code
p var   # Print variable
w       # Where (show stack trace)

Sanitizers

Sanitizers detect various types of bugs at runtime.

Address Sanitizer (ASan)

Detects memory errors like use-after-free, buffer overflows, and memory leaks.
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DARROW_USE_ASAN=ON
make -j8

# Run tests
export ASAN_OPTIONS=detect_leaks=1
./build-debug/arrow-array-test

Undefined Behavior Sanitizer (UBSan)

Detects undefined behavior like integer overflow, null pointer dereference.
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DARROW_USE_UBSAN=ON
make -j8

Thread Sanitizer (TSan)

Detects data races and thread-related issues.
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DARROW_USE_TSAN=ON
make -j8
Don’t combine multiple sanitizers in the same build. They have conflicts and overhead.

Common Debugging Scenarios

Segmentation Faults

1

Get a backtrace

gdb --args ./program
(gdb) run
# When it crashes:
(gdb) backtrace
2

Enable core dumps

ulimit -c unlimited
./program
# After crash:
gdb ./program core
(gdb) backtrace
3

Use AddressSanitizer

Rebuild with ASan and re-run. It often pinpoints the exact error.

Memory Leaks

1

Use Valgrind

valgrind --leak-check=full ./build-debug/arrow-array-test
2

Use AddressSanitizer

export ASAN_OPTIONS=detect_leaks=1
./build-debug/arrow-array-test

Build Failures

# C++
rm -rf build-debug
mkdir build-debug && cd build-debug
cmake ..
make -j8

# Python
cd arrow/python
rm -rf build/
python setup.py clean --all
python setup.py build_ext --inplace
# CMake
make VERBOSE=1

# Python
export PYARROW_BUILD_VERBOSE=1
python setup.py build_ext --inplace
# Verify CMake can find dependencies
cmake .. -DCMAKE_FIND_DEBUG_MODE=ON

Test Failures

# C++
./build-debug/arrow-array-test --gtest_filter=TestArray.TestBasics

# Python
pytest pyarrow/tests/test_array.py::test_basics -v

# R
devtools::test_active_file()
gdb --args ./build-debug/arrow-array-test --gtest_filter=TestArray.TestBasics
(gdb) break arrow_array.cc:123
(gdb) run
# C++
./arrow-array-test --gtest_verbose

# Python
pytest pyarrow/tests/test_array.py -vv -s

Import Errors (Python)

import sys
print(sys.path)

import pyarrow
print(pyarrow.__file__)
print(pyarrow.__version__)

# Check if C++ library loads
import pyarrow._lib
If imports fail:
# Ensure you're in the right directory
cd arrow/python

# Check if built in-place
ls -la pyarrow/*.so

# Rebuild if needed
python setup.py build_ext --inplace

Debugging CI Failures

When tests pass locally but fail in CI:
1

Check the CI logs

Look for the specific error message and stack trace in the CI output.
2

Reproduce the CI environment

Use Docker to reproduce the exact CI environment:
# See dev/docker-compose.yml for available images
docker-compose run ubuntu-cpp
3

Check for platform-specific issues

Test on the same platform where CI failed (Linux, macOS, Windows).
4

Review sanitizer reports

CI runs with AddressSanitizer and UndefinedBehaviorSanitizer. Check for sanitizer warnings in logs.

Logging and Error Messages

C++ Logging

Arrow uses a custom logging system:
#include <arrow/util/logging.h>

ARROW_LOG(INFO) << "Processing array with length: " << array->length();
ARROW_LOG(WARNING) << "Unexpected null values";
ARROW_LOG(ERROR) << "Failed to allocate memory";
Control log level:
export ARROW_LOG_LEVEL=DEBUG
./program

Python Logging

import logging
import pyarrow as pa

logging.basicConfig(level=logging.DEBUG)
pa.set_cpu_count(4)  # Will log CPU count changes

Performance Debugging

See the Benchmarking Guide for:
  • Running performance benchmarks
  • Comparing performance across versions
  • Identifying performance regressions
Additional profiling tools:
  • perf (Linux): CPU profiling
  • Instruments (macOS): System-wide profiling
  • py-spy (Python): Python profiling without code changes
  • valgrind —tool=callgrind: Call graph profiling

Resources

GDB Documentation

Official GDB documentation

LLDB Tutorial

Getting started with LLDB

AddressSanitizer

Address Sanitizer documentation

Valgrind Manual

Valgrind quick start guide

Build docs developers (and LLMs) love