Skip to main content
Maintaining service availability and monitoring resource metrics are vital for service operations. OroCloud monitoring processes ensure service continuity, efficient troubleshooting, and proactive resource management.

Monitoring tools

Oro internal monitoring

Oro uses industry-standard and in-house monitoring tools for all OroCloud environments. These tools power a comprehensive monitoring system that controls all vital aspects of infrastructure and application. The Oro support team uses an alert management system, a defined escalation procedure, and an incident response plan to manage detected incidents.
Oro does not provide access to its internal monitoring system, nor does it subscribe customers to internal alerts.

Google Cloud’s operations suite

Oro customers and partners can configure additional monitoring metrics using Google Cloud’s Operations Suite.
Google Cloud’s Operations Suite allows monitoring application availability using uptime checks. An uptime check:
  • Tries to open the application URL and measures response time
  • Can connect from multiple locations in North America, South America, Europe, and Asia-Pacific
  • Can check the main page or any other page, including authenticated pages (use a dedicated application user)
Results are available via the GCP web GUI. See GCP uptime checks documentation for more information.
Keep the number of uptime checks reasonable to avoid adding unnecessary workload to the application.
Google Cloud’s Operations Suite Metrics Explorer provides collection, visualization, and alerting on OS metrics such as CPU load, disk load, load balancer, and more.The Oro support team monitors all key OS metrics and responds to alerts triggered by threshold violations.See GCP Metrics Explorer documentation for more information.

NewRelic and Blackfire

Customers can enable NewRelic and Blackfire monitoring solutions for their OroCloud environment. You must obtain your own license for any such tool.
Other proprietary monitoring suites require additional examination before Oro commits to implementation and support.

Metrics monitored by OroCloud support

This section describes the metrics monitored for every OroCloud environment. Use this as a reference for creating your own monitoring system.
There is no customer access or visibility into the metrics described here. The exact set of metrics, alerting, and escalation rules depends on the environment type (e.g., staging vs. production) and evolves as the OroCloud team improves monitoring.

OS metrics

  • CPU usage and load average
  • Disk space utilization
  • Disk IO metrics
  • RAM utilization
  • SWAP usage
  • Network bandwidth utilization and statistics
  • Process count
  • Zombie process count
  • Logged users count

Component server metrics

ComponentMonitored metrics
NginxInternal server statistics, connection count, requests rate, PHP-FPM process count
PostgreSQLConnection count, index usage, internal memory allocation, requests rate, slow requests, replication, backup status, locks
RedisCollection size, allocated memory, requests rate, cluster status
RabbitMQQueue count and sizes, memory consumption, connection count, cluster state
ElasticsearchJVM metrics, cluster state, requests rate, backup status

Application metrics

  • Web check — The main page is opened every few minutes; the primary availability indicator.
  • SSL checks — Verifies SSL certificate validity and renewal date.
  • DNS check — Verifies DNS record correctness.
  • HTTP status statistics — Tracks the ratio of non-OK responses (4xx and 5xx).
  • Application error statistics — Detects abnormalities and faults in application errors.
  • RabbitMQ application queues — Verifies that all application-specific message queues are present and processing.
  • Oro consumers — Checks that consumers are processing messages from RabbitMQ.
  • Application orders, users, and SKU statistics

Incident response

Alert thresholds

Oro monitoring defines two alert levels:
A warning threshold violation indicates the application may experience issues if the metric does not recover. Warnings allow proactive prevention before an incident occurs (e.g., disk usage warning).Warnings do not initiate an incident response and are processed routinely during business hours.
A critical threshold violation indicates an application incident is imminent or already in progress. Once triggered, these alerts initiate the incident response process.

Incident management

The OroCloud team uses an Incident Response Plan that covers:
  • SWAT team members and roles — Contact details, office and emergency numbers for the incident resolution team.
  • Incident triggers — Conditions that trigger service recovery actions.
  • Notification flow — Who should be informed and when during incident response.
  • Escalation process — How and why an incident may be escalated; may involve additional resources.
  • Incident closing steps — Actions to take after the incident is resolved.
  • Post-mortem analysis — Root cause identification and preventive measures (product fixes, infrastructure changes, process improvements, training, etc.).
When an incident occurs, affected OroCloud customers receive an email notification. The support team may request cooperative actions from the customer’s IT team. Customers are also notified when service is restored.

Planned maintenance windows

Maintenance windows for production OroCloud environments are planned and scheduled in advance. If the OroCloud service team initiates maintenance that involves only infrastructure changes, alerts are muted during the window.

Build docs developers (and LLMs) love