Overview
Write stalls occur when RocksDB slows down or temporarily stops accepting writes to prevent running out of resources. This is a critical flow control mechanism that protects database stability.
Write stalls indicate that writes are arriving faster than compaction can process them. While stalls protect the database, they impact write throughput and latency.
Why Write Stalls Happen
From advanced_options.h:539-716, RocksDB triggers write stalls when:
- Too many L0 files - Exceeds
level0_slowdown_writes_trigger or level0_stop_writes_trigger
- Pending compaction bytes - Exceeds
soft_pending_compaction_bytes_limit or hard_pending_compaction_bytes_limit
- Too many memtables - Immutable memtables waiting for flush
Stall Conditions
Level 0 File Thresholds
From advanced_options.h:539-553:
// Soft limit - starts slowing down writes
int level0_slowdown_writes_trigger = 20; // Default
// Hard limit - stops writes completely
int level0_stop_writes_trigger = 36; // Default
How it works:
Options options;
options.level0_slowdown_writes_trigger = 20;
options.level0_stop_writes_trigger = 36;
// L0 files: 0-19 → Normal write speed
// L0 files: 20-35 → Writes slowed (delayed_write_rate)
// L0 files: 36+ → Writes stopped until compaction catches up
Dynamically changeable through SetOptions() API.
Pending Compaction Bytes
From advanced_options.h:702-716:
// Soft limit - slow down writes
uint64_t soft_pending_compaction_bytes_limit = 64 * 1024 * 1024 * 1024ULL; // 64 GB
// Hard limit - stop writes
uint64_t hard_pending_compaction_bytes_limit = 256 * 1024 * 1024 * 1024ULL; // 256 GB
Purpose: Prevents unbounded accumulation of data awaiting compaction.
Options options;
options.soft_pending_compaction_bytes_limit = 64ULL * 1024 * 1024 * 1024;
options.hard_pending_compaction_bytes_limit = 256ULL * 1024 * 1024 * 1024;
// Estimated pending bytes < 64 GB → Normal
// Estimated pending bytes 64-256 GB → Delayed writes
// Estimated pending bytes > 256 GB → Writes stopped
Memtable Limits
From advanced_options.h:259-270:
int max_write_buffer_number = 2; // Maximum write buffers
Write stalls when:
- Active memtable is full
- All
max_write_buffer_number slots are occupied
- Flush can’t keep up with write rate
Write Delay Rate
When slowdown is triggered, writes are delayed to this rate:
DBOptions db_options;
db_options.delayed_write_rate = 16 * 1024 * 1024; // 16 MB/s (default)
RocksDB may automatically adjust delayed_write_rate based on flush/compaction speed to find the optimal rate.
Monitoring Write Stalls
Statistics
From statistics.h:205:
STALL_MICROS // Total time writes were stalled (in microseconds)
Check if stalls are occurring:
auto stats = options.statistics;
uint64_t stall_micros = stats->getTickerCount(STALL_MICROS);
if (stall_micros > 0) {
double stall_seconds = stall_micros / 1000000.0;
printf("Writes stalled for %.2f seconds\n", stall_seconds);
}
Histogram
From statistics.h:622:
WRITE_STALL // Distribution of stall durations
HistogramData stall_hist;
stats->histogramData(WRITE_STALL, &stall_hist);
printf("Stall Duration:\n");
printf(" Median: %.2f us\n", stall_hist.median);
printf(" P95: %.2f us\n", stall_hist.percentile95);
printf(" P99: %.2f us\n", stall_hist.percentile99);
printf(" Max: %.2f us\n", stall_hist.max);
Database Properties
// Check pending compaction bytes
uint64_t pending_bytes;
db->GetIntProperty(
"rocksdb.estimate-pending-compaction-bytes",
&pending_bytes
);
// Check L0 file count
uint64_t l0_files;
db->GetIntProperty("rocksdb.num-files-at-level0", &l0_files);
// Check memtable flush pending
std::string flush_pending;
db->GetProperty("rocksdb.mem-table-flush-pending", &flush_pending);
printf("Pending compaction: %lu bytes\n", pending_bytes);
printf("L0 files: %lu\n", l0_files);
printf("Flush pending: %s\n", flush_pending.c_str());
Preventing Write Stalls
1. Increase L0 Thresholds
Options options;
// More relaxed thresholds (requires more compaction resources)
options.level0_slowdown_writes_trigger = 30; // Was 20
options.level0_stop_writes_trigger = 50; // Was 36
Increasing thresholds allows more L0 files, which can slow down reads and require more compaction resources.
2. Increase Write Buffers
From advanced_options.h:259-282:
Options options;
options.write_buffer_size = 128 * 1024 * 1024; // 128 MB (was 64 MB)
options.max_write_buffer_number = 4; // More buffers
options.min_write_buffer_number_to_merge = 2;
Larger buffers mean:
- Fewer L0 files created
- More time between flushes
- Higher memory usage
3. Increase Compaction Resources
DBOptions db_options;
db_options.max_background_jobs = 8; // More threads for flush & compaction
// Or separately
db_options.max_background_compactions = 6;
db_options.max_background_flushes = 2;
4. Adjust Compaction Limits
Options options;
options.soft_pending_compaction_bytes_limit = 128ULL * 1024 * 1024 * 1024; // 128 GB
options.hard_pending_compaction_bytes_limit = 512ULL * 1024 * 1024 * 1024; // 512 GB
5. Optimize Compaction Speed
// Use faster compression for hot levels
options.compression_per_level = {
kNoCompression, // L0 - no compression overhead
kLZ4Compression, // L1 - fast compression
kLZ4Compression, // L2
kZSTD, // L3+ - balanced
};
// Reduce compaction work per file
options.target_file_size_base = 128 * 1024 * 1024; // Larger files, fewer of them
6. Enable Dynamic Level Bytes
From advanced_options.h:581-665:
options.level_compaction_dynamic_level_bytes = true; // Default in newer versions
Dynamic level bytes:
- Adapts to write traffic
- Reduces write amplification
- More predictable LSM shape
Handling Stalls in Applications
Detect and Retry
Status PutWithRetry(DB* db, const Slice& key, const Slice& value) {
WriteOptions write_opts;
int retry_count = 0;
while (retry_count < 3) {
Status s = db->Put(write_opts, key, value);
if (s.ok()) {
return s;
}
if (s.IsTimedOut() || s.IsTryAgain()) {
// Write stalled, wait and retry
LOG(WARNING) << "Write stalled, retrying in 100ms...";
std::this_thread::sleep_for(std::chrono::milliseconds(100));
retry_count++;
} else {
// Other error, don't retry
return s;
}
}
return Status::TimedOut("Write stalled after retries");
}
Proactive Monitoring
class StallMonitor {
public:
void CheckAndAlert(DB* db, std::shared_ptr<Statistics> stats) {
// Check L0 file count
uint64_t l0_files;
db->GetIntProperty("rocksdb.num-files-at-level0", &l0_files);
if (l0_files > level0_slowdown_threshold_ * 0.8) {
LOG(WARNING) << "L0 files approaching stall threshold: " << l0_files;
// Alert monitoring system
SendAlert("L0 files high", l0_files);
}
// Check pending compaction
uint64_t pending_bytes;
db->GetIntProperty(
"rocksdb.estimate-pending-compaction-bytes",
&pending_bytes
);
if (pending_bytes > soft_pending_limit_ * 0.8) {
LOG(WARNING) << "Pending compaction approaching limit: "
<< pending_bytes / (1024 * 1024 * 1024) << " GB";
SendAlert("Pending compaction high", pending_bytes);
}
// Check recent stalls
uint64_t stall_micros = stats->getTickerCount(STALL_MICROS);
if (stall_micros > last_stall_micros_) {
uint64_t new_stalls = stall_micros - last_stall_micros_;
LOG(ERROR) << "Write stalled for " << new_stalls / 1000 << " ms";
SendAlert("Write stall detected", new_stalls);
}
last_stall_micros_ = stall_micros;
}
private:
int level0_slowdown_threshold_ = 20;
uint64_t soft_pending_limit_ = 64ULL * 1024 * 1024 * 1024;
uint64_t last_stall_micros_ = 0;
};
Advanced Configuration
Rate Limiter
Control flush/compaction I/O to leave headroom for user requests:
#include "rocksdb/rate_limiter.h"
DBOptions db_options;
db_options.rate_limiter.reset(
NewGenericRateLimiter(
100 * 1024 * 1024, // 100 MB/s for background I/O
100 * 1000, // Refill period: 100ms
10 // Fairness
)
);
Disable Auto Compaction (Advanced)
ColumnFamilyOptions cf_options;
cf_options.disable_auto_compactions = true;
// Manually trigger compaction during low-traffic periods
db->CompactRange(CompactRangeOptions(), nullptr, nullptr);
Disabling auto compaction is dangerous. Only use if you have a sophisticated manual compaction strategy.
Troubleshooting Common Scenarios
Scenario 1: Sudden Write Spike
Symptoms: Writes stall during high-traffic periods
Solutions:
// 1. Increase write buffers to absorb spikes
options.write_buffer_size = 256 * 1024 * 1024; // Larger buffers
options.max_write_buffer_number = 6; // More buffers
// 2. More aggressive compaction
db_options.max_background_jobs = 12;
// 3. Relax stall triggers
options.level0_slowdown_writes_trigger = 30;
options.level0_stop_writes_trigger = 50;
Scenario 2: Slow Compaction
Symptoms: Persistent stalls even with moderate write rate
Solutions:
// 1. Faster compression
options.compression = kLZ4Compression; // Was kZSTD
// 2. More compaction threads
db_options.max_background_compactions = 8;
// 3. Larger files (fewer compactions)
options.target_file_size_base = 256 * 1024 * 1024;
// 4. Check if I/O is bottleneck - add rate limiter headroom
db_options.rate_limiter->SetBytesPerSecond(200 * 1024 * 1024);
Scenario 3: Small Writes, Many L0 Files
Symptoms: Frequent small flushes creating many L0 files
Solutions:
// 1. Larger write buffers
options.write_buffer_size = 128 * 1024 * 1024;
// 2. Merge multiple memtables before flush
options.min_write_buffer_number_to_merge = 3;
// 3. Level compaction dynamic level bytes
options.level_compaction_dynamic_level_bytes = true;
Best Practices
Prevention is better than cure: Configure RocksDB to handle your expected peak write rate with headroom.
- Monitor proactively: Alert when approaching stall conditions (80% of threshold)
- Size write buffers appropriately: Balance memory usage and flush frequency
- Provision compaction resources: Ensure
max_background_jobs can handle write rate
- Use dynamic level bytes: Enables better adaptation to write patterns
- Test under load: Validate configuration with realistic write workloads
- Plan for bursts: Size buffers to absorb temporary write spikes
Recommended Starting Point
Options options;
// Write buffers
options.write_buffer_size = 128 * 1024 * 1024; // 128 MB
options.max_write_buffer_number = 4;
options.min_write_buffer_number_to_merge = 2;
// L0 stall conditions
options.level0_slowdown_writes_trigger = 20;
options.level0_stop_writes_trigger = 36;
// Compaction limits
options.soft_pending_compaction_bytes_limit = 64ULL * 1024 * 1024 * 1024;
options.hard_pending_compaction_bytes_limit = 256ULL * 1024 * 1024 * 1024;
// Background jobs
DBOptions db_options;
db_options.max_background_jobs = 8; // Adjust based on CPU cores
// Compression
options.compression_per_level = {
kNoCompression,
kLZ4Compression,
kLZ4Compression,
kZSTD,
};
// Dynamic level bytes
options.level_compaction_dynamic_level_bytes = true;
// Enable statistics
db_options.statistics = CreateDBStatistics();
See Also