Monitoring tools
Oro internal monitoring
Oro uses industry-standard and in-house monitoring tools for all OroCloud environments. These tools power a comprehensive monitoring system that controls all vital aspects of infrastructure and application. The Oro support team uses an alert management system, a defined escalation procedure, and an incident response plan to manage detected incidents.Oro does not provide access to its internal monitoring system, nor does it subscribe customers to internal alerts.
Google Cloud’s operations suite
Oro customers and partners can configure additional monitoring metrics using Google Cloud’s Operations Suite.Uptime monitoring
Uptime monitoring
Google Cloud’s Operations Suite allows monitoring application availability using uptime checks. An uptime check:
- Tries to open the application URL and measures response time
- Can connect from multiple locations in North America, South America, Europe, and Asia-Pacific
- Can check the main page or any other page, including authenticated pages (use a dedicated application user)
Keep the number of uptime checks reasonable to avoid adding unnecessary workload to the application.
OS metrics monitoring
OS metrics monitoring
Google Cloud’s Operations Suite Metrics Explorer provides collection, visualization, and alerting on OS metrics such as CPU load, disk load, load balancer, and more.The Oro support team monitors all key OS metrics and responds to alerts triggered by threshold violations.See GCP Metrics Explorer documentation for more information.
NewRelic and Blackfire
Customers can enable NewRelic and Blackfire monitoring solutions for their OroCloud environment. You must obtain your own license for any such tool.Other proprietary monitoring suites require additional examination before Oro commits to implementation and support.
Metrics monitored by OroCloud support
This section describes the metrics monitored for every OroCloud environment. Use this as a reference for creating your own monitoring system.OS metrics
- CPU usage and load average
- Disk space utilization
- Disk IO metrics
- RAM utilization
- SWAP usage
- Network bandwidth utilization and statistics
- Process count
- Zombie process count
- Logged users count
Component server metrics
| Component | Monitored metrics |
|---|---|
| Nginx | Internal server statistics, connection count, requests rate, PHP-FPM process count |
| PostgreSQL | Connection count, index usage, internal memory allocation, requests rate, slow requests, replication, backup status, locks |
| Redis | Collection size, allocated memory, requests rate, cluster status |
| RabbitMQ | Queue count and sizes, memory consumption, connection count, cluster state |
| Elasticsearch | JVM metrics, cluster state, requests rate, backup status |
Application metrics
- Web check — The main page is opened every few minutes; the primary availability indicator.
- SSL checks — Verifies SSL certificate validity and renewal date.
- DNS check — Verifies DNS record correctness.
- HTTP status statistics — Tracks the ratio of non-OK responses (4xx and 5xx).
- Application error statistics — Detects abnormalities and faults in application errors.
- RabbitMQ application queues — Verifies that all application-specific message queues are present and processing.
- Oro consumers — Checks that consumers are processing messages from RabbitMQ.
- Application orders, users, and SKU statistics
Incident response
Alert thresholds
Oro monitoring defines two alert levels:Warning
Warning
A warning threshold violation indicates the application may experience issues if the metric does not recover. Warnings allow proactive prevention before an incident occurs (e.g., disk usage warning).Warnings do not initiate an incident response and are processed routinely during business hours.
Critical
Critical
A critical threshold violation indicates an application incident is imminent or already in progress. Once triggered, these alerts initiate the incident response process.
Incident management
The OroCloud team uses an Incident Response Plan that covers:- SWAT team members and roles — Contact details, office and emergency numbers for the incident resolution team.
- Incident triggers — Conditions that trigger service recovery actions.
- Notification flow — Who should be informed and when during incident response.
- Escalation process — How and why an incident may be escalated; may involve additional resources.
- Incident closing steps — Actions to take after the incident is resolved.
- Post-mortem analysis — Root cause identification and preventive measures (product fixes, infrastructure changes, process improvements, training, etc.).