Introduction
Choosing the right data format significantly impacts your ML pipeline performance. This guide covers format comparison, benchmarking, and optimization strategies.Format Comparison
Different formats offer tradeoffs between speed, size, and compatibility.Common Formats
CSV
Pros: Human-readable, universal support
Cons: Slow, large file size, no type preservation
Use case: Small datasets, data interchange
Cons: Slow, large file size, no type preservation
Use case: Small datasets, data interchange
Parquet
Pros: Columnar, compressed, fast queries
Cons: Not human-readable
Use case: Large datasets, analytics, production
Cons: Not human-readable
Use case: Large datasets, analytics, production
Feather
Pros: Fast read/write, language-agnostic
Cons: Limited compression
Use case: Intermediate data, caching
Cons: Limited compression
Use case: Intermediate data, caching
HDF5
Pros: Hierarchical, append mode, large arrays
Cons: Complex API, portability issues
Use case: Scientific data, time series
Cons: Complex API, portability issues
Use case: Scientific data, time series
Pandas Format Benchmarking
Performance Comparison
Based on the Python for Scientific Computing guide:
Key Findings:
- Parquet: Best overall performance for read/write
- Feather: Fastest read times, good write performance
- HDF5: Good for append operations
- CSV: Slowest, but most portable
Benchmark Code
Create your own benchmarks:Inference Performance
Processing large datasets efficiently requires parallelization.Single Worker Baseline
processing/inference_example.py
ProcessPoolExecutor
Parallelize with multiple processes:processing/inference_example.py
Ray for Distributed Processing
processing/inference_example.py
Benchmark Results
Running experiments with 10M samples:Performance Comparison
| Approach | Time (seconds) | Speedup |
|---|---|---|
| Single worker | 12.64 | 1.0x |
| ThreadPoolExecutor (16 workers) | 0.85 | 14.9x |
| ProcessPoolExecutor (16 workers) | 4.03 | 3.1x |
| Ray (16 workers) | 2.19 | 5.8x |
Key Insights:
- ThreadPoolExecutor is fastest due to GIL release during I/O
- ProcessPoolExecutor has overhead from process spawning
- Ray provides good balance with distributed capabilities
- Choose based on your workload: I/O-bound vs CPU-bound
Optimization Strategies
Columnar Formats
Columnar Formats
Use Parquet or Feather for:
- Selective column reading
- Better compression ratios
- Faster aggregations
- Type preservation
Compression
Compression
Enable compression for:
- Network transfers
- Storage cost reduction
- I/O-bound workloads
- Snappy: Fast, moderate compression
- Gzip: Slower, high compression
- LZ4: Very fast, light compression
- Zstd: Balanced speed and ratio
Chunking
Chunking
Process data in chunks:Benefits:
- Lower memory usage
- Parallelization opportunities
- Early termination possible
Memory Mapping
Memory Mapping
Use memory-mapped arrays for:
- Very large arrays
- Random access patterns
- Shared memory across processes
Format Selection Guide
- Small Data (<1GB)
- Medium Data (1-100GB)
- Large Data (>100GB)
- Streaming Data
Recommendation: CSV or Feather
- CSV for shareability
- Feather for speed
- Format choice less critical at this scale
Resources
- Data Formats with Pandas and NumPy
- An Empirical Evaluation of Columnar Storage Formats (PDF)
- How to Choose the Right Python Concurrency API
- Speed up your Data Science Code
Next Steps
- Learn about Streaming Datasets for training pipelines
- Explore Vector Databases for embeddings