Overview
The Halo CMMS supports observability through:- OpenTelemetry for metrics and traces collection
- Google Cloud Monitoring for metrics visualization and alerting
- Google Cloud Trace for distributed tracing
- Automatic Java instrumentation for JVM metrics and RPC monitoring
- Custom Halo metrics for CMMS-specific operations
Architecture
The monitoring setup consists of:- OpenTelemetry Java Agent - Automatically instruments JVM applications
- OpenTelemetry Collector - Receives, processes, and exports telemetry data
- OpenTelemetry Operator - Manages collector deployment and instrumentation injection
- Google Cloud Services - Stores and visualizes metrics and traces
Prerequisites
Before you start:
- Deploy a Halo component (Kingdom, Duchy, or Reporting Server)
- See: Kingdom Deployment or Duchy Deployment
- Have
kubectlconfigured for your cluster - Have appropriate Google Cloud permissions
Google Cloud Configuration
Enable required APIs
Enable Cloud Monitoring and Cloud Trace APIs from the APIs and Services page:
OpenTelemetry Deployment
Install cert-manager
The OpenTelemetry Operator requires cert-manager for webhook certificates.Install OpenTelemetry Operator
Deploy the OpenTelemetry Operator:Create OpenTelemetry Configuration
The Halodev environment provides reference configurations:
open_telemetry_gke.cue- GKE-specific configurationopen_telemetry.cue- Base configuration
Generate configuration from CUE (optional)
If using the Halo CUE-based configuration:The generated YAML will be in
bazel-bin/src/main/k8s/dev/.Enable Instrumentation
The OpenTelemetry Operator can automatically inject the Java agent into your pods.Create Instrumentation Resource
Annotate Deployments
Add the instrumentation annotation to your pod specifications:Restart Deployments
Restart all deployments to pick up the Java agent instrumentation:Available Metrics
Metrics are visible in Google Cloud Monitoring under the Workload domain.Automatic Java Instrumentation Metrics
JVM Metrics
JVM Metrics
Class Loading:
jvm.class.count- Current number of loaded classesjvm.class.loaded- Total number of classes loaded since JVM startjvm.class.unloaded- Total number of classes unloaded since JVM start
jvm.cpu.count- Number of available processorsjvm.cpu.recent_utilization- Recent CPU utilizationjvm.cpu.time- CPU time used by the JVM
jvm.memory.committed- Amount of memory committed for JVM to usejvm.memory.limit- Maximum amount of memory availablejvm.memory.used- Amount of memory currently usedjvm.memory.used_after_last_gc- Memory used after last garbage collection
jvm.gc.duration- Time spent in garbage collection
jvm.thread.count- Current number of threads
RPC Metrics
RPC Metrics
Client-side:
rpc.client.duration- Duration of RPC client requests
rpc.server.duration- Duration of RPC server requests
- RPC method
- Status code
- Target service
Halo Custom Metrics
CMMS-Specific Metrics
CMMS-Specific Metrics
Thread Pool:
halo_cmm.thread_pool.size- Thread pool sizehalo_cmm.thread_pool.active_count- Number of active threads
halo_cmm.computation.stage.crypto.cpu.time- CPU time spent in cryptographic operationshalo_cmm.computation.stage.crypto.time- Wall-clock time for cryptographic operationshalo_cmm.computation.stage.time- Total time for computation stages
halo_cmm.retention.deleted_measurements- Number of measurements deleted by retention policieshalo_cmm.retention.deleted_exchanges- Number of exchanges deleted by retention policieshalo_cmm.retention.cancelled_measurements- Number of measurements cancelled by retention policies
Health Checks and Diagnostics
Collector Health
Check OpenTelemetry Collector health:Verify Metrics Export
Check if metrics are reaching Google Cloud Monitoring:Verify Traces Export
Check if traces are reaching Google Cloud Trace: Visit Cloud Trace in Google Cloud Console to see trace data.Creating Dashboards
Navigate to Monitoring
Go to Google Cloud Monitoring in the Cloud Console.
Setting Up Alerts
Configure conditions
Example alert conditions:High Memory Usage:High RPC Error Rate:Pod Restart Rate:
Troubleshooting
Metrics not appearing in Cloud Monitoring
Metrics not appearing in Cloud Monitoring
Check collector status:Verify service account permissions:Check Workload Identity binding:Look for the
iam.gke.io/gcp-service-account annotation.Instrumentation not injected
Instrumentation not injected
Verify Instrumentation resource:Check pod annotations:Check operator logs:
High cardinality warnings
High cardinality warnings
If you see warnings about high cardinality metrics:
- Review metric labels and reduce unnecessary dimensions
- Use metric filtering in the collector configuration
- Aggregate metrics before export
- Consider sampling high-volume metrics
Best Practices
Monitoring Strategy
- Monitor both infrastructure (Kubernetes) and application (Halo) metrics
- Set up alerts for critical issues (OOM, high error rates, pod crashes)
- Create dashboards for different audiences (operators, developers, business)
- Regularly review and update alert thresholds
- Use distributed tracing to debug complex multi-service issues