About the Prometheus integration
The Prometheus integration allows you to monitor your Aiven services and understand the resource usage. Using this integration, you can track non-service-specific system metrics that provide insights into infrastructure performance. To start using Prometheus for monitoring metrics, configure the Prometheus integration and set up the Prometheus server.Get a list of available service metrics
To discover the metrics available for your services, make an HTTPGET request to your Prometheus service endpoint.
Collect connection information
Once your Prometheus integration is configured, collect the following details from Aiven Console:
- Navigate to the Overview page of your service
- Go to the Connection information section
- Open the Prometheus tab
- Copy:
- Prometheus URL
- Username
- Password
Request metrics snapshot
Make a request to get a snapshot of your metrics:Replace
USERNAME, PASSWORD, and PROMETHEUS_URL with your service values.Key system metrics
CPU usage metrics
CPU usage metrics help determine if the CPU is constantly being maxed out.High-level CPU usage
For a single CPU service, get an overview with:A process with a
nice value larger than 0 is categorized as cpu_usage_nice, which is not included in cpu_usage_user.CPU I/O wait monitoring
Monitor
cpu_usage_iowait{cpu="cpu-total"} to detect I/O bottlenecks. A high value indicates that the service node is working on something I/O intensive.For example, if cpu_usage_iowait{cpu="cpu-total"} equals 40, the CPU is idle waiting for disk or network I/O operations for 40% of the time.CPU metrics reference
These metrics are generated from the Telegraf CPU plugin:| Metric | Description |
|---|---|
cpu_usage_idle | Percentage of time the CPU is idle |
cpu_usage_system | Percentage of time the Kernel code is consuming the CPU |
cpu_usage_user | Percentage of time the CPU is in user-space programs with nice ≤ 0 |
cpu_usage_nice | Percentage of time the CPU is in user-space programs with nice > 0 |
cpu_usage_iowait | Percentage of time the CPU is idle when the system has pending disk I/O operations |
cpu_usage_steal | Percentage of time waiting for the hypervisor to give CPU cycles to the VM |
cpu_usage_irq | Percentage of time the system is handling interrupts |
cpu_usage_softirq | Percentage of time the system is handling software interrupts |
cpu_usage_guest | Percentage of time the CPU is running for a guest OS |
cpu_usage_guest_nice | Percentage of time the CPU is running for a guest OS with low priority |
Disk usage metrics
Monitoring disk usage ensures that applications or processes don’t fail due to insufficient disk storage.Consider monitoring
disk_used_percent and disk_free to prevent storage-related issues.Disk metrics reference
| Metric | Description |
|---|---|
disk_free | Free space on the service disk |
disk_used | Used space on the disk (e.g., 1.0e+9 = 8,000,000,000 bytes) |
disk_total | Total space on the disk (free and used) |
disk_used_percent | Percentage of disk space used: disk_used / disk_total * 100 (e.g., 80 = 80% usage) |
disk_inodes_free | Number of index nodes available on the service disk |
disk_inodes_used | Number of index nodes used on the service disk |
disk_inodes_total | Total number of index nodes on the service disk |
Memory usage metrics
Memory consumption metrics are essential to ensure the performance of your service.Consider monitoring
mem_available (in bytes) or mem_available_percent, as this represents the estimated amount of memory available for applications without swapping.Key memory metrics
mem_available- Available memory for applications (bytes)mem_available_percent- Available memory as a percentagemem_total- Total system memorymem_used- Used memorymem_free- Free memorymem_cached- Cached memorymem_buffered- Buffered memory
Network usage metrics
Monitoring the network provides visibility of your network utilization and traffic, allowing you to act immediately in case of network issues.It may be worth monitoring the number of established TCP sessions available in the
netstat_tcp_established metric.Key network metrics
netstat_tcp_established- Number of established TCP connectionsnet_bytes_sent- Bytes sent over the networknet_bytes_recv- Bytes received from the networknet_packets_sent- Packets sent over the networknet_packets_recv- Packets received from the networknet_err_in- Inbound network errorsnet_err_out- Outbound network errors
Monitoring best practices
Alerting thresholds
Consider setting alerts for:- CPU usage - Alert when
100 - cpu_usage_idle> 80% for extended periods - CPU I/O wait - Alert when
cpu_usage_iowait> 30% consistently - Disk usage - Alert when
disk_used_percent> 85% - Memory available - Alert when
mem_available_percent< 15% - TCP connections - Alert on unusual changes in
netstat_tcp_established
Query examples
Visualization recommendations
For effective monitoring dashboards:- Overview panel - Display CPU, memory, disk, and network at a glance
- CPU breakdown - Show user, system, and I/O wait separately
- Disk trends - Plot usage over time to predict capacity needs
- Memory pressure - Track available memory and swap usage
- Network traffic - Monitor bytes sent/received and error rates
Troubleshooting
No metrics appearing
- Verify the Prometheus integration is enabled on your service
- Check credentials (username and password) are correct
- Confirm the Prometheus URL is accessible
- Allow a few minutes for initial metrics to populate
Incomplete metrics
- Some metrics may not be available for all service types
- Check service logs for any integration errors
- Verify the Prometheus server is configured correctly
High cardinality warnings
- Limit the number of unique label combinations
- Use Prometheus recording rules for frequently queried metrics
- Consider adjusting retention policies