What is Exposed
Metrics are exposed from two places:Operator Controller (:9090/metrics)
The operator controller exposes metrics about cluster management:
redis_cluster_phase- Current phase of each Redis clusterredis_cluster_instances_total- Total number of instances per clusterredis_failover_total- Count of failover operationsredis_reconcile_duration_seconds- Reconciliation loop durationredis_last_successful_backup_timestamp- Timestamp of last successful backupredis_backup_phase_count- Count of backups by phase
Redis Instance Manager (:8080/metrics)
Each data pod runs an instance manager that exposes Redis-level metrics:
Availability and role:
redis_up- Redis instance health (1 = up, 0 = down)redis_instance_info- Instance metadata (role, version, etc.)
redis_replication_lag_bytes- Replication lag in bytesredis_connected_replicas- Number of connected replicasredis_master_link_up- Master link status for replicas
redis_used_memory_bytes- Current memory usageredis_maxmemory_bytes- Maximum memory limitredis_mem_fragmentation_ratio- Memory fragmentation ratioredis_evicted_keys_total- Total number of evicted keys
redis_connected_clients- Number of connected clientsredis_blocked_clients- Number of blocked clientsredis_rejected_connections_total- Total rejected connections
redis_instantaneous_ops_per_sec- Current operations per secondredis_command_calls_total- Total command calls by command typeredis_keyspace_hits_total- Total keyspace hitsredis_keyspace_misses_total- Total keyspace misses
redis_rdb_last_save_timestamp- Last RDB save timestampredis_rdb_last_bgsave_duration_seconds- Last BGSAVE durationredis_aof_last_rewrite_duration_seconds- Last AOF rewrite duration
Helm Setup
Enable monitoring components in your Helm values:values.yaml
ServiceMonitorfor operator metrics scrapingPodMonitorfor instance-manager metrics scrapingPrometheusRulewith default Redis alertsConfigMapwith Grafana dashboard (auto-discovered viagrafana_dashboard: "1"label)
Requires Prometheus Operator to be installed in your cluster.
Grafana Dashboard
The bundled dashboard is located atcharts/redis-operator/dashboards/redis-overview.json and includes:
- Cluster overview - Phase, desired/healthy/unhealthy instances
- Replication health - Lag, connected replicas, master link status
- Memory and connections - Usage, fragmentation, client connections
- Command throughput - Operations per second, hit ratio
- Persistence - RDB/AOF backup visibility
Installation Options
Automatic (Recommended)
Keep
monitoring.grafanaDashboard.enabled=true in Helm values. Grafana’s sidecar will automatically discover and import the dashboard.Alerting Rules
The operator creates the following default alerts:RedisPrimaryUnavailable
Fires when no primary instance is available for a cluster. Severity: CriticalRedisReplicationLagHigh
Fires when replication lag exceeds the configured threshold. Severity: WarningDefault threshold: 10 MB
RedisMemoryUsageHigh
Fires when memory usage exceeds the configured ratio. Severity: WarningDefault threshold: 85%
RedisBackupMissing
Fires when no successful backup has completed within the configured window. Severity: WarningDefault threshold: 24 hours
RedisInstanceDown
Fires when a Redis instance is down. Severity: CriticalTuning Alert Thresholds
Global Thresholds
Configure global alert thresholds in Helm values:values.yaml
Per-Cluster Customization
Create additionalPrometheusRule resources for cluster-specific thresholds:
custom-alerts.yaml
Keep the default
PrometheusRule enabled for baseline coverage, then layer on additional rules for specific clusters.Custom Query Packs
For organization-specific monitoring needs, version custom PromQL queries in ConfigMaps:custom-queries.yaml
Validation
Performance Impact
Metrics collection has minimal overhead:- Controller metrics: ~1 MB memory, negligible CPU
- Instance manager metrics: ~2-5 MB memory per pod, 1% CPU
- Scrape interval: Default 30s (configurable in ServiceMonitor/PodMonitor)