Monitoring overview
Aiven offers multiple levels of monitoring:Built-in Metrics
- Real-time service metrics
- CPU, memory, disk, network
- Service-specific metrics
- Available in Console
Service Logs
- Service operation logs
- Error and debug messages
- Connection logs
- 4-day retention
Audit Logs
- Organization events
- Project events
- User actions
- Configuration changes
Service metrics
View real-time metrics for all your services:Built-in metrics
Available for every service without additional configuration:- Host Metrics
- Service-Specific
Infrastructure-level metrics
Percentage of CPU resources consumed by the service
Percentage of memory utilized by the service
Percentage of disk space used
5-minute average CPU load indicating system computational load
Input/output operations per second for disk reads
Input/output operations per second for disk writes
Network traffic received by the service
Network traffic transmitted by the service
Viewing metrics
Advanced metrics integration
For detailed service-specific metrics, set up metrics integration:Metrics integration requires separate PostgreSQL and Grafana services (additional cost). Predefined dashboards are automatically created and maintained.
Service logs
View logs for troubleshooting and monitoring:Accessing service logs
- Aiven Console
- Aiven CLI
Log retention
Log integration
Send logs to OpenSearch for long-term storage and analysis:All services in the project can send logs to the same OpenSearch service. Create one OpenSearch service for centralized logging.
Audit logs
Track administrative actions and changes:Organization audit logs
View organization-level events:Project event logs
View project-level events:API access to audit logs
Organization audit logs require
organization:audit_logs:read permission. Project logs require project:audit_logs:read permission.Prometheus integration
Expose metrics in Prometheus format for scraping:Enabling Prometheus
Create Prometheus endpoint
Navigate to project → Integration endpoints → Add new endpoint → Prometheus
Enable for service
Service → Overview → Service integrations → Manage integrations → Prometheus → Enable
Get metrics endpoint
Service → Overview → Connection information → Prometheus tabCopy the Service URI and credentials
Prometheus in VPC
If using VPC peering, enable public Prometheus access:Prometheus metrics
Available metrics include:- System metrics: CPU, memory, disk, network
- Service metrics: Connections, queries, cache hits
- Custom metrics: Application-specific (service dependent)
External integrations
Integrate Aiven metrics and logs with external platforms:Datadog integration
Send metrics to Datadog:Create Datadog endpoint
Project → Integration endpoints → DatadogEnter your Datadog API key and site (US/EU)
Jolokia (JMX) integration
Access JMX metrics for Kafka and other Java services:Rsyslog integration
Send logs to external syslog servers:Alerts and notifications
Set up alerts for service issues:Email notifications
Manage project and service notifications:Service contacts
Add contacts per service:Grafana alerts
Set up alerts in Grafana for metrics:Configure alert rule
- Set threshold (e.g., CPU > 80%)
- Set evaluation interval
- Configure for clause (duration)
Common alert scenarios
High CPU usage
High CPU usage
Alert: CPU usage > 80% for 10 minutesActions:
- Review slow queries or processes
- Check for unusual traffic patterns
- Consider upgrading service plan
Low disk space
Low disk space
Alert: Disk usage > 85%Actions:
- Review disk usage by table/index
- Clean up unnecessary data
- Enable disk autoscaler
- Upgrade to larger plan
High connection count
High connection count
Alert: Connections > 80% of maxActions:
- Check for connection leaks in applications
- Implement connection pooling
- Review connection limits
- Upgrade service plan
Replication lag
Replication lag
Alert: Replication lag > 60 secondsActions:
- Check network connectivity
- Review write load on primary
- Check replica performance
- Contact support if persists
Monitoring best practices
Monitor key service metrics
Focus on:
- Resource utilization (CPU, memory, disk)
- Connection count and errors
- Query performance and slow queries
- Replication lag (if applicable)
Troubleshooting
Metrics not appearing in Grafana
Metrics not appearing in Grafana
Cause: Integration not properly configured or needs time to populateSolution:
- Verify metrics integration is active
- Wait 1-2 minutes for initial data
- Check PostgreSQL has space and is running
- Verify network connectivity if using VPC
Cannot access Prometheus endpoint
Cannot access Prometheus endpoint
Cause: Service in VPC without public Prometheus accessSolution:
- Enable public access:
public_access.prometheus=true - Or access from peered VPC
- Check IP allowlist includes your scraper’s IP
Logs not appearing in OpenSearch
Logs not appearing in OpenSearch
Cause: Integration not enabled or OpenSearch fullSolution:
- Verify log integration is active
- Check OpenSearch disk space
- Review index lifecycle management settings
- Check for ingestion errors in OpenSearch
Not receiving alert emails
Not receiving alert emails
Cause: Email addresses not configured or notifications disabledSolution:
- Verify email addresses in project notifications
- Check spam folder
- Verify notification types are enabled
- Test with manual service restart
API reference
Next steps
Service Integrations
Set up metrics and log integrations
Security
Review security audit logs
VPC & Networking
Configure network for Prometheus access
Users & Permissions
Grant audit log access permissions