Monitoring - Microservices Infrastructure

Overview

The microservice-infra stack includes a complete observability platform:

Prometheus - Metrics collection and alerting
Grafana - Visualization and dashboards
Alertmanager - Alert routing and management
Loki - Log aggregation
Tempo - Distributed tracing
OpenTelemetry Collector - Telemetry data collection

Access Points

All monitoring services are exposed via NodePort:

Service	URL	Credentials	Notes
Grafana	http://localhost:30300	admin/admin	Primary dashboards
Prometheus	http://localhost:30090	None	Metrics and queries
Alertmanager	http://localhost:30093	None	Alert management
Traefik	http://localhost:30081	None	Ingress dashboard
Hubble UI	http://localhost:31235	None	Cilium/Full mode only
ArgoCD HTTP	http://localhost:30080	See ArgoCD docs	Full mode only
ArgoCD HTTPS	https://localhost:30443	See ArgoCD docs	Full mode only

Source: README.md:65-75

Grafana

Navigate to http://localhost:30300
Login with:
- Username: admin
- Password: admin
You’ll be prompted to change password (can skip)

Available Dashboards

Dashboards are automatically provisioned from dashboards/src/ using Grafonnet (Jsonnet for Grafana). Source files:

dashboards/src/k8s-cluster.jsonnet - Kubernetes cluster metrics
dashboards/src/sample-app.jsonnet - Sample application metrics
dashboards/src/g.libsonnet - Shared Grafonnet library

Create Custom Dashboard

Using Grafonnet (Recommended)

Create new Jsonnet file:

vim dashboards/src/my-dashboard.jsonnet

Define dashboard:

local g = import 'g.libsonnet';

g.dashboard.new('My Application')
+ g.dashboard.withUid('my-app')
+ g.dashboard.time.withFrom('now-1h')
+ g.dashboard.withPanels([
  g.panel.timeSeries.new('Request Rate')
  + g.panel.timeSeries.queryOptions.withTargets([
    g.query.prometheus.new(
      'prometheus',
      'rate(http_requests_total[5m])'
    ),
  ])
  + g.panel.timeSeries.gridPos.withW(12)
  + g.panel.timeSeries.gridPos.withH(8),
])

Build dashboards:

cd dashboards
jb install  # Install dependencies
jsonnet -J vendor src/my-dashboard.jsonnet > dist/my-dashboard.json

Import via nixidy module or Grafana UI

Using Grafana UI

Go to http://localhost:30300/dashboard/new
Add panels with queries
Save dashboard
Export JSON: Dashboard settings → JSON Model
(Optional) Convert to Grafonnet for version control

Explore Metrics

Grafana Explore view: http://localhost:30300/explore

Select “Prometheus” data source

Enter PromQL query, e.g.:

rate(container_cpu_usage_seconds_total[5m])

Run query to visualize

Prometheus

Query Interface

Access Prometheus UI: http://localhost:30090 Graph view:

Enter PromQL query
Click “Execute”
View as graph or table

Example queries:

# CPU usage by pod
rate(container_cpu_usage_seconds_total{pod!=""}[5m])

# Memory usage by namespace
sum by (namespace) (container_memory_usage_bytes)

# HTTP request rate
rate(http_requests_total[5m])

# Pod restart count
kube_pod_container_status_restarts_total

Targets and Service Discovery

View discovered targets: http://localhost:30090/targets Shows:

All Kubernetes service monitors
Scrape status (UP/DOWN)
Last scrape time and duration
Error messages

Alert Rules

View configured alerts: http://localhost:30090/alerts Shows:

Active alerts
Alert state (pending, firing)
Alert labels and annotations

Alertmanager

Access: http://localhost:30093

View Active Alerts

Navigate to http://localhost:30093/#/alerts
See all firing alerts
Filter by label (namespace, severity, etc.)

Silence Alerts

Go to “Silences” tab
Click “New Silence”
Add matchers (e.g., alertname=KubePodCrashLooping)
Set duration
Add comment
Create silence

Alert Routing

Configure routing in nixidy module:

# nixidy/env/local/kube-prometheus-stack.nix
alerting = {
  alertmanagers = [{
    namespace = "observability";
    name = "alertmanager";
    port = 9093;
  }];
};

Loki (Logs)

Query Logs in Grafana

Go to http://localhost:30300/explore
Select “Loki” data source
Enter LogQL query:

# All logs from namespace
{namespace="observability"}

# Logs from specific pod
{pod="grafana-abc123"}

# Filter by content
{namespace="microservices"} |= "error"

# JSON field extraction
{namespace="microservices"} | json | level="error"

Log Aggregation

Logs are stored in Garage (S3-compatible storage):

Retention: Configured in Loki settings
Backend: garage namespace, S3 API
Chunks: Compressed and deduplicated

Tempo (Traces)

Query Traces in Grafana

Go to http://localhost:30300/explore
Select “Tempo” data source
Search by:
- Trace ID
- Service name
- Tags
- Duration

Trace Ingestion

Traces are sent via OpenTelemetry Collector: Application → OTel Collector → Tempo → Garage (storage) Collector endpoints:

OTLP gRPC: otel-collector.observability:4317
OTLP HTTP: otel-collector.observability:4318
Jaeger: otel-collector.observability:14250

OpenTelemetry Collector

Custom Collector Build

The project uses a custom OTel Collector with specific receivers and exporters. Build definition: otel-collector/ Load into cluster:

load-otel-collector-image

This script (scripts/load-otel-collector-image.sh):

Checks for cached image in R2
Downloads or builds collector
Loads into Kind cluster

Configuration

Collector config is defined in nixidy module:

# nixidy/env/local/otel-collector.nix
receivers = {
  otlp = {
    protocols = {
      grpc = { endpoint = "0.0.0.0:4317"; };
      http = { endpoint = "0.0.0.0:4318"; };
    };
  };
};

exporters = {
  otlp/tempo = {
    endpoint = "tempo.observability:4317";
    tls.insecure = true;
  };
  prometheus = {
    endpoint = "0.0.0.0:8889";
  };
};

Hubble UI (Cilium Mode)

Available in bootstrap-full and full-bootstrap modes.

Access Hubble

Web UI: http://localhost:31235 CLI:

# Observe network flows
hubble observe -n microservices

# Follow flows in real-time
hubble observe -n microservices --follow

# Filter by pod
hubble observe --pod sample-app

# Filter by verdict (FORWARDED, DROPPED)
hubble observe --verdict DROPPED

Network Policies

Visualize network policies in Hubble UI:

Select namespace
View service map
See allowed/denied connections
Identify policy violations

Monitoring Workflows

Investigate High CPU

Check top pods:
```
kubectl top pods -A --sort-by=cpu
```

Query Prometheus:

topk(10, rate(container_cpu_usage_seconds_total[5m]))

View in Grafana:
- Open “Kubernetes Cluster” dashboard
- Filter by namespace
- Drill down to pod

Debug Application Errors

Check logs in Loki:

{namespace="microservices", app="sample-app"} |= "error"

View traces in Tempo:
- Find trace ID from logs
- Search in Tempo
- Analyze span duration

Check metrics:

rate(http_requests_total{status=~"5.."}[5m])

Alert on Condition

Test query in Prometheus:

rate(http_requests_total{status="500"}[5m]) > 0.1

Create PrometheusRule in nixidy:

prometheusRules.myapp = {
  groups = [{
    name = "myapp";
    rules = [{
      alert = "HighErrorRate";
      expr = ''rate(http_requests_total{status="500"}[5m]) > 0.1'';
      for = "5m";
      annotations = {
        summary = "High 500 error rate";
      };
    }];
  }];
};

Apply and verify:

gen-manifests
kubectl apply -f manifests/kube-prometheus-stack/

Troubleshooting

Grafana Won’t Load

Check pod status:

kubectl get pods -n observability -l app.kubernetes.io/name=grafana
kubectl logs -n observability -l app.kubernetes.io/name=grafana

Prometheus Not Scraping

Check targets: http://localhost:30090/targets
Look for service monitors:
```
kubectl get servicemonitor -A
```

Verify Prometheus config:

kubectl get prometheus -n observability -o yaml

Loki Queries Timeout

Check Loki pod:

kubectl logs -n observability -l app.kubernetes.io/name=loki

Verify Garage connection:

kubectl get pods -n storage

Missing Dashboards

Regenerate and apply:

gen-manifests
kubectl apply -f manifests/kube-prometheus-stack/
kubectl rollout restart deployment/grafana -n observability

Getting Started

Bootstrap Modes

Architecture

Operations

Components

Development

​Overview

​Access Points

​Grafana

​First Login

​Available Dashboards

​Create Custom Dashboard

​Using Grafonnet (Recommended)

​Using Grafana UI

​Explore Metrics

​Prometheus

​Query Interface

​Targets and Service Discovery

​Alert Rules

​Alertmanager

​View Active Alerts

​Silence Alerts

​Alert Routing

​Loki (Logs)

​Query Logs in Grafana

​Log Aggregation

​Tempo (Traces)

​Query Traces in Grafana

​Trace Ingestion

​OpenTelemetry Collector

​Custom Collector Build

​Configuration

​Hubble UI (Cilium Mode)

​Access Hubble

​Network Policies

​Monitoring Workflows

​Investigate High CPU

​Debug Application Errors

​Alert on Condition

​Troubleshooting

​Grafana Won’t Load

​Prometheus Not Scraping

​Loki Queries Timeout

​Missing Dashboards

Build docs developers (and LLMs) love