Running Tests
- PyArrow
- R Package
- C++
Apache Arrow’s Python implementation uses pytest for unit testing.Tests for Parquet are located in a separate folder:
Test Structure
Tests in PyArrow follow the pytest convention for “Tests as part of application code”:pyarrow/tests/parquet/.Running PyArrow Tests
You can also run tests with
python -m pytest [...] which adds the current directory to sys.path and can help if pytest [...] results in an ImportError.Test Groups
Many tests are grouped using pytest marks. Some groups are disabled by default:- Enable a group:
--$GROUP_NAME(e.g.,--parquet) - Disable a group:
--disable-$GROUP_NAME(e.g.,--disable-parquet) - Run only a group:
--only-$GROUP_NAME(e.g.,--only-parquet)
| Group | Description |
|---|---|
dataset | Apache Arrow Dataset tests |
flight | Flight RPC tests |
gandiva | Gandiva expression compiler tests (uses LLVM) |
hdfs | Tests using libhdfs for Hadoop filesystem |
hypothesis | Tests using hypothesis for random test cases (use --enable-hypothesis) |
large_memory | Tests requiring large amounts of system RAM |
orc | Apache ORC tests |
parquet | Apache Parquet tests |
Troubleshooting
If tests start failing, try recompiling PyArrow or Arrow C++:Test Fixtures
PyArrow test files contain helper functions and fixtures. Common examples:_alltypes_exampleintest_pandas: Supplies a dataframe with 100 rows for all data types_check_pandas_roundtripintest_pandas: Asserts roundtrip conversion from Pandas through Arrow structureslarge_bufferfixture: Supplies a PyArrow buffer of fixed size
Best Practices
When to Add Tests
In general, any change to source code needs accompanying unit tests:- Add functionality → Add unit tests
- Modify functionality → Update unit tests
- Solve a bug → Add unit test before fixing (helps prove the bug and its fix)
- Performance improvements → Reflect in benchmarks (which are also tests)
- Refactoring → May not need test changes if fully covered by existing tests
Writing Quality Tests
Keep tests focused
Keep tests focused
Each test should verify a single behavior or feature. Avoid overloading tests with multiple assertions for unrelated functionality.
Use descriptive names
Use descriptive names
Test names should clearly describe what they’re testing:
Minimize dependencies
Minimize dependencies
Tests should have as few external dependencies as possible. If testing file reading, provide the smallest possible example file or code to create one.
Make tests reproducible
Make tests reproducible
Tests should produce consistent results across different environments and runs. Avoid depending on timing, network conditions, or external state.
Continuous Integration
All tests run automatically in CI pipelines when you submit a pull request. The CI system tests:- Multiple platforms (Linux, macOS, Windows)
- Different compiler versions
- Various build configurations
- Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan)
Running CI Checks Locally
Before submitting a PR, you can run some CI checks locally:Resources
pytest Documentation
Complete guide to pytest framework
testthat Documentation
R package testing with testthat
Google Test Primer
Introduction to Google Test framework
Arrow CI Overview
Learn about Arrow’s CI infrastructure