Monitoring and Observability

Learn how to deploy comprehensive monitoring and observability for your Halo CMMS deployments using OpenTelemetry, Google Cloud Monitoring, and Cloud Trace.

Overview

The Halo CMMS supports observability through:

OpenTelemetry for metrics and traces collection
Google Cloud Monitoring for metrics visualization and alerting
Google Cloud Trace for distributed tracing
Automatic Java instrumentation for JVM metrics and RPC monitoring
Custom Halo metrics for CMMS-specific operations

Architecture

The monitoring setup consists of:

OpenTelemetry Java Agent - Automatically instruments JVM applications
OpenTelemetry Collector - Receives, processes, and exports telemetry data
OpenTelemetry Operator - Manages collector deployment and instrumentation injection
Google Cloud Services - Stores and visualizes metrics and traces

Prerequisites

Before you start:

Deploy a Halo component (Kingdom, Duchy, or Reporting Server)
See: Kingdom Deployment or Duchy Deployment
Have kubectl configured for your cluster
Have appropriate Google Cloud permissions

Google Cloud Configuration

Enable required APIs

Enable Cloud Monitoring and Cloud Trace APIs from the APIs and Services page:

gcloud services enable monitoring.googleapis.com
gcloud services enable cloudtrace.googleapis.com

Create service account

Create an IAM service account for OpenTelemetry with Workload Identity:

# Create service account
gcloud iam service-accounts create open-telemetry \
  --display-name="OpenTelemetry Collector"

# Grant monitoring and trace permissions
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:open-telemetry@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/monitoring.metricWriter"

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:open-telemetry@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/cloudtrace.agent"

Configure Workload Identity

Bind the Kubernetes service account to the Google Cloud service account:

gcloud iam service-accounts add-iam-policy-binding \
  open-telemetry@PROJECT_ID.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[default/open-telemetry]"

OpenTelemetry Deployment

Install cert-manager

The OpenTelemetry Operator requires cert-manager for webhook certificates.

Version CompatibilityEnsure cert-manager, OpenTelemetry Operator, and collector image versions are compatible. See the Compatibility Matrix.Recommended versions:

cert-manager: v1.18.2
OpenTelemetry Operator: v0.129.1
Collector image: Specified in the Halo configuration

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.18.2/cert-manager.yaml

Install OpenTelemetry Operator

Deploy the OpenTelemetry Operator:

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.129.1/opentelemetry-operator.yaml

Verify the operator is running:

kubectl get pods -n opentelemetry-operator-system

Create OpenTelemetry Configuration

The Halo dev environment provides reference configurations:

open_telemetry_gke.cue - GKE-specific configuration
open_telemetry.cue - Base configuration

Generate configuration from CUE (optional)

If using the Halo CUE-based configuration:

bazel build //src/main/k8s/dev:open_telemetry_gke

The generated YAML will be in bazel-bin/src/main/k8s/dev/.

Customize configuration

Customize the generated or reference configuration for your environment:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: default
spec:
  mode: deployment
  serviceAccount: open-telemetry
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    
    processors:
      batch:
    
    exporters:
      googlecloud:
        project: YOUR_PROJECT_ID
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [googlecloud]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [googlecloud]

Apply configuration

Apply the OpenTelemetry configuration:

kubectl apply -f opentelemetry-config.yaml

Verify the collector is running:

kubectl get pods -l app.kubernetes.io/component=opentelemetry-collector
kubectl logs -l app.kubernetes.io/component=opentelemetry-collector

Enable Instrumentation

The OpenTelemetry Operator can automatically inject the Java agent into your pods.

Create Instrumentation Resource

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: open-telemetry-java-agent
spec:
  exporter:
    endpoint: http://default-collector:4317
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "1.0"

Apply the instrumentation:

kubectl apply -f instrumentation.yaml

Annotate Deployments

Add the instrumentation annotation to your pod specifications:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kingdom-v2alpha-public-api-server
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-java: "open-telemetry-java-agent"
    spec:
      # ... rest of pod spec

Restart Deployments

Restart all deployments to pick up the Java agent instrumentation:

for deployment in $(kubectl get deployments -o name); do 
  kubectl rollout restart $deployment
done

Verify instrumentation is active:

# Check for JAVA_TOOL_OPTIONS environment variable
kubectl get pods -o jsonpath='{.items[0].spec.containers[0].env[?(@.name=="JAVA_TOOL_OPTIONS")]}'

Available Metrics

Metrics are visible in Google Cloud Monitoring under the Workload domain.

Automatic Java Instrumentation Metrics

JVM Metrics

Class Loading:

jvm.class.count - Current number of loaded classes
jvm.class.loaded - Total number of classes loaded since JVM start
jvm.class.unloaded - Total number of classes unloaded since JVM start

CPU:

jvm.cpu.count - Number of available processors
jvm.cpu.recent_utilization - Recent CPU utilization
jvm.cpu.time - CPU time used by the JVM

Memory:

jvm.memory.committed - Amount of memory committed for JVM to use
jvm.memory.limit - Maximum amount of memory available
jvm.memory.used - Amount of memory currently used
jvm.memory.used_after_last_gc - Memory used after last garbage collection

Garbage Collection:

jvm.gc.duration - Time spent in garbage collection

Threads:

jvm.thread.count - Current number of threads

RPC Metrics

Client-side:

rpc.client.duration - Duration of RPC client requests

Server-side:

rpc.server.duration - Duration of RPC server requests

These metrics include labels for:

RPC method
Status code
Target service

Halo Custom Metrics

CMMS-Specific Metrics

Thread Pool:

halo_cmm.thread_pool.size - Thread pool size
halo_cmm.thread_pool.active_count - Number of active threads

Computation:

halo_cmm.computation.stage.crypto.cpu.time - CPU time spent in cryptographic operations
halo_cmm.computation.stage.crypto.time - Wall-clock time for cryptographic operations
halo_cmm.computation.stage.time - Total time for computation stages

Retention:

halo_cmm.retention.deleted_measurements - Number of measurements deleted by retention policies
halo_cmm.retention.deleted_exchanges - Number of exchanges deleted by retention policies
halo_cmm.retention.cancelled_measurements - Number of measurements cancelled by retention policies

Health Checks and Diagnostics

Collector Health

Check OpenTelemetry Collector health:

# Collector pods
kubectl get pods -l app.kubernetes.io/component=opentelemetry-collector

# Collector logs
kubectl logs -l app.kubernetes.io/component=opentelemetry-collector --tail=100

# Check for export errors
kubectl logs -l app.kubernetes.io/component=opentelemetry-collector | grep -i error

Verify Metrics Export

Check if metrics are reaching Google Cloud Monitoring:

# Using gcloud
gcloud monitoring time-series list \
  --filter='metric.type=starts_with("workload.googleapis.com")' \
  --limit=10

Or visit the Metrics Explorer in Google Cloud Console.

Verify Traces Export

Check if traces are reaching Google Cloud Trace: Visit Cloud Trace in Google Cloud Console to see trace data.

Creating Dashboards

Navigate to Monitoring

Go to Google Cloud Monitoring in the Cloud Console.

Create dashboard

Click Dashboards → Create Dashboard

Add charts

Add charts for key metrics:Example: JVM Memory Usage

Resource Type: Kubernetes Container
Metric: workload.googleapis.com/jvm.memory.used
Filter: cluster_name = "your-cluster"
Grouping: container_name

Example: RPC Request Duration

Resource Type: Kubernetes Container
Metric: workload.googleapis.com/rpc.server.duration
Aggregator: 95th percentile
Filter: cluster_name = "your-cluster"

Example: Computation Stage Time

Resource Type: Kubernetes Container  
Metric: workload.googleapis.com/halo_cmm.computation.stage.time
Filter: cluster_name = "your-cluster"
Grouping: stage_name

Setting Up Alerts

Create alert policy

Navigate to Monitoring → Alerting → Create Policy

Configure conditions

Example alert conditions:High Memory Usage:

Metric: workload.googleapis.com/jvm.memory.used
Condition: Above threshold
Threshold: 1.5 GB
Duration: 5 minutes

High RPC Error Rate:

Metric: workload.googleapis.com/rpc.server.duration
Filter: rpc.grpc.status_code != "OK"
Condition: Rate of change
Threshold: > 10 errors/minute

Pod Restart Rate:

Metric: kubernetes.io/container/restart_count
Condition: Rate of change
Threshold: > 5 restarts in 10 minutes

Configure notifications

Set up notification channels:

Email
Slack
PagerDuty
SMS

Troubleshooting

Metrics not appearing in Cloud Monitoring

Check collector status:

kubectl logs -l app.kubernetes.io/component=opentelemetry-collector

Verify service account permissions:

gcloud projects get-iam-policy PROJECT_ID \
  --flatten="bindings[].members" \
  --filter="bindings.members:open-telemetry@PROJECT_ID.iam.gserviceaccount.com"

Check Workload Identity binding:

kubectl describe sa open-telemetry

Look for the iam.gke.io/gcp-service-account annotation.

Instrumentation not injected

Verify Instrumentation resource:

kubectl get instrumentation
kubectl describe instrumentation open-telemetry-java-agent

Check pod annotations:

kubectl get pod <pod-name> -o jsonpath='{.metadata.annotations}'

Check operator logs:

kubectl logs -n opentelemetry-operator-system -l app.kubernetes.io/name=opentelemetry-operator

High cardinality warnings

If you see warnings about high cardinality metrics:

Review metric labels and reduce unnecessary dimensions
Use metric filtering in the collector configuration
Aggregate metrics before export
Consider sampling high-volume metrics

Best Practices

Monitoring Strategy

Monitor both infrastructure (Kubernetes) and application (Halo) metrics
Set up alerts for critical issues (OOM, high error rates, pod crashes)
Create dashboards for different audiences (operators, developers, business)
Regularly review and update alert thresholds
Use distributed tracing to debug complex multi-service issues

Cost Management

Monitor Google Cloud Monitoring costs in the Billing console
Adjust metric retention periods based on needs
Use metric filtering to reduce exported metrics
Consider sampling for high-volume traces
Archive old metrics to Cloud Storage if needed

Get Started

Setup & Deployment

Operations

Development

Concepts

Overview

Architecture

Prerequisites

Google Cloud Configuration

OpenTelemetry Deployment

Install cert-manager

Install OpenTelemetry Operator

Create OpenTelemetry Configuration

Enable Instrumentation

Create Instrumentation Resource

Annotate Deployments

Restart Deployments

Available Metrics

Automatic Java Instrumentation Metrics

Halo Custom Metrics

Health Checks and Diagnostics

Collector Health

Verify Metrics Export

Verify Traces Export

Creating Dashboards

Setting Up Alerts

Troubleshooting

Best Practices

Additional Resources

Build docs developers (and LLMs) love

Get Started

Setup & Deployment

Operations

Development

Concepts

​Overview

​Architecture

​Prerequisites

​Google Cloud Configuration

​OpenTelemetry Deployment

​Install cert-manager

​Install OpenTelemetry Operator

​Create OpenTelemetry Configuration

​Enable Instrumentation

​Create Instrumentation Resource

​Annotate Deployments

​Restart Deployments

​Available Metrics

​Automatic Java Instrumentation Metrics

​Halo Custom Metrics

​Health Checks and Diagnostics

​Collector Health

​Verify Metrics Export

​Verify Traces Export

​Creating Dashboards

​Setting Up Alerts

​Troubleshooting

​Best Practices

​Additional Resources

Build docs developers (and LLMs) love

Overview

Architecture

Prerequisites

Google Cloud Configuration

OpenTelemetry Deployment

Install cert-manager

Install OpenTelemetry Operator

Create OpenTelemetry Configuration

Enable Instrumentation

Create Instrumentation Resource

Annotate Deployments

Restart Deployments

Available Metrics

Automatic Java Instrumentation Metrics

Halo Custom Metrics

Health Checks and Diagnostics

Collector Health

Verify Metrics Export

Verify Traces Export

Creating Dashboards

Setting Up Alerts

Troubleshooting

Best Practices

Additional Resources