Large Dataset Best Practices
When sharing large datasets (hundreds of GBs to TBs), implement these optimization strategies:Partition Your Data
Partitioning significantly improves query performance by enabling partition pruning:- Reduced data scanning - Only relevant partitions are read
- Faster query execution - Fewer files to process
- Lower network transfer - Less data moved over the network
- Cost savings - Reduced compute and data transfer costs
File Size Optimization
Optimal file sizes balance parallelism and overhead: Recommended File Sizes:- Minimum: 128 MB - Avoids excessive file listing overhead
- Optimal: 256 MB to 1 GB - Good balance for most workloads
- Maximum: 2 GB - Prevents memory issues in readers
Limit Hint Usage
UselimitHint to reduce data transfer when you only need a subset:
Delta Format vs Parquet Format
Delta Sharing supports two response formats with different performance characteristics:Response Format Comparison
Parquet Format (Default)
Parquet Format (Default)
When to Use:Response Structure:
- Legacy clients (delta-sharing-spark < 3.1)
- Simple tables without advanced Delta features
- Maximizing client compatibility
- Compatible with all Delta Sharing clients
- Server converts Delta metadata to simplified format
- Limited to basic Delta features (minReaderVersion = 1)
- No support for Deletion Vectors or Column Mapping
Delta Format (Advanced)
Delta Format (Advanced)
When to Use:Request Header:Response Structure:
- Modern clients (delta-sharing-spark >= 3.1)
- Tables with advanced Delta features
- Need for Deletion Vectors or Column Mapping
- Performance optimization through native Delta reading
- Supports advanced Delta features (minReaderVersion > 1)
- Native Delta log structure preserved
- Better performance with Delta Kernel
- Deletion Vectors for efficient updates
- Column Mapping for schema evolution
Format Selection Strategy
Performance ImpactDelta format can provide 2-5x better query performance for tables with:
- Deletion Vectors (updates/deletes)
- Column Mapping (schema evolution)
- Large partition counts
Batch Conversion for Memory Optimization
For large tables that don’t fit in memory, use batch conversion:Memory-Efficient Data Loading
- Fetches file metadata from server
- Downloads and converts Parquet files in smaller batches
- Concatenates results incrementally
- Reduces peak memory consumption
| Method | Peak Memory | Best For |
|---|---|---|
| Standard | ~2x table size | Tables < 10 GB |
| Batch Conversion | ~1.2x table size | Tables > 10 GB |
| Streaming (Spark) | ~100 MB per partition | Tables > 100 GB |
Spark Streaming for Continuous Data
Predicate Pushdown Optimization
Predicate pushdown filters data at the source, reducing network transfer:SQL Predicate Hints
| Operator | Example | Use Case |
|---|---|---|
= | col = 123 | Exact match |
>, < | col > 1000 | Range queries |
>=, <= | col >= '2024-01-01' | Inclusive ranges |
<> | col <> 'US' | Exclusion |
IS NULL | col IS NULL | Null checks |
IS NOT NULL | col IS NOT NULL | Non-null checks |
JSON Predicate Hints (Recommended)
JSON predicates provide structured, type-safe filtering:- Type-safe filtering with explicit
valueType - Complex boolean logic (
and,or,not) - Easier for servers to parse and optimize
- Preferred over SQL predicateHints
Version and Time Travel Performance
Querying historical versions efficiently:Version-Based Queries
- Version queries: Fast - direct metadata lookup
- Timestamp queries: Slower - requires timestamp-to-version mapping
- Latest version: Fastest - no additional lookups
Change Data Feed (CDF) Performance
- Use
ending_versionto limit query window - Enable
convert_in_batchesfor large change sets - Query smaller version ranges for faster results
- Use Delta format for tables with frequent updates
Caching Strategies
Implement caching to reduce repeated data transfers:Client-Side Caching
File-Level Caching
Delta Sharing uses fileid for consistent file identification:
- Cache downloaded Parquet files by
id - Check
expirationTimestampbefore reusing URLs - Implement LRU eviction for cache size management
- Use local SSD for cache storage
Network and Connection Optimization
Connection Pooling
Parallel Downloads
Monitoring and Profiling
Track performance metrics to identify bottlenecks:Query Timing
Server-Side Metrics
Monitor Delta Sharing server performance: Key Metrics:- Request latency (p50, p95, p99)
- Requests per second
- Error rates (401, 403, 500)
- Data transfer volume
- Active connections
Performance Tuning Checklist
Optimization Checklist
Optimization Checklist
Data Organization:
- Partition tables by common query dimensions
- Maintain optimal file sizes (256 MB - 1 GB)
- Use Z-ordering for multi-dimensional queries (Delta only)
- Regularly compact small files
- Use predicateHints or jsonPredicateHints for filtering
- Set appropriate limitHint values
- Enable Delta format for advanced features
- Use batch conversion for large tables
- Implement connection pooling
- Use parallel downloads for multiple files
- Cache frequently accessed data
- Monitor and optimize bandwidth usage
- Configure appropriate check intervals
- Limit versions per RPC (10-20)
- Use checkpoints for fault tolerance
- Monitor consumer lag
- Track query latencies
- Monitor data transfer volumes
- Set up alerts for performance degradation
- Profile slow queries regularly