Prometheus system metrics

Learn how to check what metrics are available for monitoring your service using Prometheus, and find out which of the available metrics are particularly worth monitoring and why.

About the Prometheus integration

The Prometheus integration allows you to monitor your Aiven services and understand the resource usage. Using this integration, you can track non-service-specific system metrics that provide insights into infrastructure performance. To start using Prometheus for monitoring metrics, configure the Prometheus integration and set up the Prometheus server.

Get a list of available service metrics

To discover the metrics available for your services, make an HTTP GET request to your Prometheus service endpoint.

Collect connection information

Once your Prometheus integration is configured, collect the following details from Aiven Console:

Navigate to the Overview page of your service
Go to the Connection information section
Open the Prometheus tab
Copy:
- Prometheus URL
- Username
- Password

Request metrics snapshot

Make a request to get a snapshot of your metrics:

curl -k --user USERNAME:PASSWORD PROMETHEUS_URL/metrics

Replace USERNAME, PASSWORD, and PROMETHEUS_URL with your service values.

Review available metrics

The output provides a full list of metrics available for your service.

Key system metrics

CPU usage metrics

CPU usage metrics help determine if the CPU is constantly being maxed out.

High-level CPU usage

For a single CPU service, get an overview with:

100 - cpu_usage_idle{cpu="cpu-total"}

A process with a nice value larger than 0 is categorized as cpu_usage_nice, which is not included in cpu_usage_user.

CPU I/O wait monitoring

Monitor cpu_usage_iowait{cpu="cpu-total"} to detect I/O bottlenecks. A high value indicates that the service node is working on something I/O intensive.For example, if cpu_usage_iowait{cpu="cpu-total"} equals 40, the CPU is idle waiting for disk or network I/O operations for 40% of the time.

CPU metrics reference

These metrics are generated from the Telegraf CPU plugin:

Metric	Description
`cpu_usage_idle`	Percentage of time the CPU is idle
`cpu_usage_system`	Percentage of time the Kernel code is consuming the CPU
`cpu_usage_user`	Percentage of time the CPU is in user-space programs with `nice` ≤ 0
`cpu_usage_nice`	Percentage of time the CPU is in user-space programs with `nice` > 0
`cpu_usage_iowait`	Percentage of time the CPU is idle when the system has pending disk I/O operations
`cpu_usage_steal`	Percentage of time waiting for the hypervisor to give CPU cycles to the VM
`cpu_usage_irq`	Percentage of time the system is handling interrupts
`cpu_usage_softirq`	Percentage of time the system is handling software interrupts
`cpu_usage_guest`	Percentage of time the CPU is running for a guest OS
`cpu_usage_guest_nice`	Percentage of time the CPU is running for a guest OS with low priority

Disk usage metrics

Monitoring disk usage ensures that applications or processes don’t fail due to insufficient disk storage.

Consider monitoring disk_used_percent and disk_free to prevent storage-related issues.

Disk metrics reference

Metric	Description
`disk_free`	Free space on the service disk
`disk_used`	Used space on the disk (e.g., `1.0e+9` = 8,000,000,000 bytes)
`disk_total`	Total space on the disk (free and used)
`disk_used_percent`	Percentage of disk space used: `disk_used / disk_total * 100` (e.g., `80` = 80% usage)
`disk_inodes_free`	Number of index nodes available on the service disk
`disk_inodes_used`	Number of index nodes used on the service disk
`disk_inodes_total`	Total number of index nodes on the service disk

Memory usage metrics

Memory consumption metrics are essential to ensure the performance of your service.

Consider monitoring mem_available (in bytes) or mem_available_percent, as this represents the estimated amount of memory available for applications without swapping.

Key memory metrics

mem_available - Available memory for applications (bytes)
mem_available_percent - Available memory as a percentage
mem_total - Total system memory
mem_used - Used memory
mem_free - Free memory
mem_cached - Cached memory
mem_buffered - Buffered memory

Network usage metrics

Monitoring the network provides visibility of your network utilization and traffic, allowing you to act immediately in case of network issues.

It may be worth monitoring the number of established TCP sessions available in the netstat_tcp_established metric.

Key network metrics

netstat_tcp_established - Number of established TCP connections
net_bytes_sent - Bytes sent over the network
net_bytes_recv - Bytes received from the network
net_packets_sent - Packets sent over the network
net_packets_recv - Packets received from the network
net_err_in - Inbound network errors
net_err_out - Outbound network errors

Monitoring best practices

Alerting thresholds

Consider setting alerts for:

CPU usage - Alert when 100 - cpu_usage_idle > 80% for extended periods
CPU I/O wait - Alert when cpu_usage_iowait > 30% consistently
Disk usage - Alert when disk_used_percent > 85%
Memory available - Alert when mem_available_percent < 15%
TCP connections - Alert on unusual changes in netstat_tcp_established

Query examples

100 - cpu_usage_idle{cpu="cpu-total"}

Visualization recommendations

For effective monitoring dashboards:

Overview panel - Display CPU, memory, disk, and network at a glance
CPU breakdown - Show user, system, and I/O wait separately
Disk trends - Plot usage over time to predict capacity needs
Memory pressure - Track available memory and swap usage
Network traffic - Monitor bytes sent/received and error rates

Troubleshooting

No metrics appearing

Verify the Prometheus integration is enabled on your service
Check credentials (username and password) are correct
Confirm the Prometheus URL is accessible
Allow a few minutes for initial metrics to populate

Incomplete metrics

Some metrics may not be available for all service types
Check service logs for any integration errors
Verify the Prometheus server is configured correctly

High cardinality warnings

Limit the number of unique label combinations
Use Prometheus recording rules for frequently queried metrics
Consider adjusting retention policies

Get Started

Platform

Services

Developer Tools

Integrations

Prometheus system metrics

About the Prometheus integration

Get a list of available service metrics

Key system metrics

CPU usage metrics

High-level CPU usage

CPU I/O wait monitoring

CPU metrics reference

Disk usage metrics

Disk metrics reference

Memory usage metrics

Key memory metrics

Network usage metrics

Key network metrics

Monitoring best practices

Alerting thresholds

Query examples

Visualization recommendations

Troubleshooting

No metrics appearing

Incomplete metrics

High cardinality warnings

Build docs developers (and LLMs) love

Get Started

Platform

Services

Developer Tools

Integrations

​About the Prometheus integration

​Get a list of available service metrics

​Key system metrics

​CPU usage metrics

​High-level CPU usage

​CPU I/O wait monitoring

​CPU metrics reference

​Disk usage metrics

​Disk metrics reference

​Memory usage metrics

​Key memory metrics

​Network usage metrics

​Key network metrics

​Monitoring best practices

​Alerting thresholds

​Query examples

​Visualization recommendations

​Troubleshooting

​No metrics appearing

​Incomplete metrics

​High cardinality warnings

​Related resources

Build docs developers (and LLMs) love

About the Prometheus integration

Get a list of available service metrics

Key system metrics

CPU usage metrics

High-level CPU usage

CPU I/O wait monitoring

CPU metrics reference

Disk usage metrics

Disk metrics reference

Memory usage metrics

Key memory metrics

Network usage metrics

Key network metrics

Monitoring best practices

Alerting thresholds

Query examples

Visualization recommendations

Troubleshooting

No metrics appearing

Incomplete metrics

High cardinality warnings

Related resources