Performance Fundamentals
Key Performance Metrics
Latency Metrics:- Read latency (p50, p99, p999)
- Write latency (p50, p99, p999)
- Transaction latency
- Replication lag
- Operations per second (reads/writes)
- Queries per second
- Network throughput
- Disk I/O throughput
- CPU usage per core
- Memory utilization
- Disk space and IOPS
- Network bandwidth
Configuration Flags
YB-TServer Performance Flags
Core Server Configuration:YB-Master Performance Flags
Memory Management
Block Cache Sizing
The block cache stores recently accessed data blocks in memory. Calculation:- Cache hit rate (target: >95%)
- Cache eviction rate
- Memory pressure indicators
Memstore Configuration
Memstores buffer writes before flushing to disk. Per-Tablet Memstore:- Higher memstore = fewer flushes, more memory usage
- Monitor flush frequency and write amplification
- Adjust based on write patterns
Memory Pressure Handling
Disk and I/O Optimization
Storage Configuration
Multiple Data Directories:- Parallel I/O across devices
- Better load distribution
- Increased throughput
Compaction Tuning
Compactions reorganize data on disk for efficient reads. Strategies:I/O Scheduling
Disk Performance Requirements:- Minimum IOPS: 1000 IOPS per TB
- Recommended: NVMe SSD or high-performance SSD
- Latency target: < 10ms for writes, < 5ms for reads
Network Optimization
Network Configuration
Compression
- Cross-region replication
- High-latency networks
- Bandwidth-constrained environments
- Reduces network bandwidth (30-50%)
- Increases CPU usage (5-10%)
- Best for large data transfers
Query Performance
Connection Management
Connection Pooling:- Reduced connection overhead
- Better resource utilization
- Support for more concurrent clients
Query Optimization
Prepared Statements:- Use prepared statements for repeated queries
- Reduces parsing overhead
- Enables better query plan caching
- Create indexes on frequently filtered columns
- Use covering indexes to avoid table lookups
- Monitor index usage and remove unused indexes
- Consider partial indexes for filtered queries
Read Replicas
Offload read traffic to dedicated read replicas:- Isolate analytics workloads
- Reduce load on primary cluster
- Serve reads closer to users geographically
Tablet Management
Tablet Splitting
Automatic tablet splitting improves performance as data grows. Configuration:- Tables outgrow initial tablet count
- Range-scanned tables with unpredictable distribution
- Hot spots on specific key ranges
- Node count exceeds tablet count
- Track tablet size distribution
- Monitor split operations in master logs
- Watch for overloaded tablets
Tablet Co-location
Co-locate small tables to reduce tablet overhead:- Reduced memory overhead
- Fewer Raft groups
- Better resource utilization
- Ideal for small reference tables
Replication and Consistency
Replication Factor
- RF=3: Standard, good balance
- RF=5: Higher durability, more write latency
- RF=1: Testing only, no fault tolerance
Read Consistency
Follower Reads:- Analytics queries tolerating staleness
- Read-heavy workloads
- Geo-distributed reads from local replicas
Clock Skew Management
Time Synchronization
NTP Configuration:- Reduced transaction latency
- Tighter timestamp bounds
- Better snapshot isolation
- Hardware time sync (PTP, GPS, atomic clock)
- Clockbound daemon on all nodes
- Sub-millisecond clock accuracy
Load Balancing
Cluster Balancing
Preferred Leaders
Pin leaders to specific zones for read performance:- Reduced read latency in primary zone
- Predictable failover behavior
- Optimized for asymmetric read/write patterns
Troubleshooting Performance
Slow Query Analysis
Enable slow query logging:RPC Slow Logs
Slow RPCs (>75% of timeout) are logged with detailed traces:- Lock contention
- Disk I/O saturation
- Network delays
- Large batch operations
Memory Analysis
Check memory usage:- High swap usage
- Frequent compaction stalls
- OOM killer activations
- Slow allocation times
Compaction Issues
Signs of compaction problems:- Increasing read latency
- Growing disk usage
- Level-0 file accumulation
- Write stalls
- Increase compaction threads
- Reduce write rate temporarily
- Add more disk throughput
- Adjust compaction triggers
Performance Benchmarking
Baseline Metrics
Establish baseline performance for your workload:Load Testing
Incremental load testing:- Start with 25% of target load
- Monitor metrics for 1 hour
- Increase by 25% increments
- Identify breaking point
- Configure for 70-80% of max capacity
- Latency percentiles (p50, p99, p999)
- CPU utilization per node
- Disk IOPS and throughput
- Network bandwidth
- Memory usage and pressure
Best Practices Summary
Resource Allocation
- CPU: 16+ cores per node, less than 70% average utilization
- Memory: 32-64GB+, 50% for block cache
- Disk: NVMe SSD, 1000+ IOPS/TB, less than 70% utilization
- Network: 10Gbps+, less than 50% utilization
Configuration Priorities
- Size block cache appropriately (50-70% of RAM)
- Configure multiple data directories for parallel I/O
- Enable automatic tablet splitting for growing tables
- Set appropriate compaction threads based on cores
- Use connection pooling for high-concurrency workloads
Monitoring and Maintenance
- Monitor key metrics continuously (latency, throughput, resources)
- Review slow queries regularly and optimize
- Check compaction health and adjust if needed
- Balance load across nodes periodically
- Test configuration changes in staging first
Next Steps
- Monitoring - Set up comprehensive performance monitoring
- Troubleshooting - Diagnose performance issues
- Admin Guide - Administrative operations
- Backup and Restore - Backup performance considerations

