Overview
The microservice-infra stack includes a complete observability platform:- Prometheus - Metrics collection and alerting
- Grafana - Visualization and dashboards
- Alertmanager - Alert routing and management
- Loki - Log aggregation
- Tempo - Distributed tracing
- OpenTelemetry Collector - Telemetry data collection
Access Points
All monitoring services are exposed via NodePort:| Service | URL | Credentials | Notes |
|---|---|---|---|
| Grafana | http://localhost:30300 | admin/admin | Primary dashboards |
| Prometheus | http://localhost:30090 | None | Metrics and queries |
| Alertmanager | http://localhost:30093 | None | Alert management |
| Traefik | http://localhost:30081 | None | Ingress dashboard |
| Hubble UI | http://localhost:31235 | None | Cilium/Full mode only |
| ArgoCD HTTP | http://localhost:30080 | See ArgoCD docs | Full mode only |
| ArgoCD HTTPS | https://localhost:30443 | See ArgoCD docs | Full mode only |
Grafana
First Login
- Navigate to http://localhost:30300
- Login with:
- Username:
admin - Password:
admin
- Username:
- You’ll be prompted to change password (can skip)
Available Dashboards
Dashboards are automatically provisioned fromdashboards/src/ using Grafonnet (Jsonnet for Grafana).
Source files:
dashboards/src/k8s-cluster.jsonnet- Kubernetes cluster metricsdashboards/src/sample-app.jsonnet- Sample application metricsdashboards/src/g.libsonnet- Shared Grafonnet library
Create Custom Dashboard
Using Grafonnet (Recommended)
-
Create new Jsonnet file:
-
Define dashboard:
-
Build dashboards:
- Import via nixidy module or Grafana UI
Using Grafana UI
- Go to http://localhost:30300/dashboard/new
- Add panels with queries
- Save dashboard
- Export JSON: Dashboard settings → JSON Model
- (Optional) Convert to Grafonnet for version control
Explore Metrics
Grafana Explore view: http://localhost:30300/explore- Select “Prometheus” data source
- Enter PromQL query, e.g.:
- Run query to visualize
Prometheus
Query Interface
Access Prometheus UI: http://localhost:30090 Graph view:- Enter PromQL query
- Click “Execute”
- View as graph or table
Targets and Service Discovery
View discovered targets: http://localhost:30090/targets Shows:- All Kubernetes service monitors
- Scrape status (UP/DOWN)
- Last scrape time and duration
- Error messages
Alert Rules
View configured alerts: http://localhost:30090/alerts Shows:- Active alerts
- Alert state (pending, firing)
- Alert labels and annotations
Alertmanager
Access: http://localhost:30093View Active Alerts
- Navigate to http://localhost:30093/#/alerts
- See all firing alerts
- Filter by label (namespace, severity, etc.)
Silence Alerts
- Go to “Silences” tab
- Click “New Silence”
- Add matchers (e.g.,
alertname=KubePodCrashLooping) - Set duration
- Add comment
- Create silence
Alert Routing
Configure routing in nixidy module:Loki (Logs)
Query Logs in Grafana
- Go to http://localhost:30300/explore
- Select “Loki” data source
- Enter LogQL query:
Log Aggregation
Logs are stored in Garage (S3-compatible storage):- Retention: Configured in Loki settings
- Backend:
garagenamespace, S3 API - Chunks: Compressed and deduplicated
Tempo (Traces)
Query Traces in Grafana
- Go to http://localhost:30300/explore
- Select “Tempo” data source
- Search by:
- Trace ID
- Service name
- Tags
- Duration
Trace Ingestion
Traces are sent via OpenTelemetry Collector: Application → OTel Collector → Tempo → Garage (storage) Collector endpoints:- OTLP gRPC:
otel-collector.observability:4317 - OTLP HTTP:
otel-collector.observability:4318 - Jaeger:
otel-collector.observability:14250
OpenTelemetry Collector
Custom Collector Build
The project uses a custom OTel Collector with specific receivers and exporters. Build definition:otel-collector/
Load into cluster:
- Checks for cached image in R2
- Downloads or builds collector
- Loads into Kind cluster
Configuration
Collector config is defined in nixidy module:Hubble UI (Cilium Mode)
Available inbootstrap-full and full-bootstrap modes.
Access Hubble
Web UI: http://localhost:31235 CLI:Network Policies
Visualize network policies in Hubble UI:- Select namespace
- View service map
- See allowed/denied connections
- Identify policy violations
Monitoring Workflows
Investigate High CPU
-
Check top pods:
-
Query Prometheus:
-
View in Grafana:
- Open “Kubernetes Cluster” dashboard
- Filter by namespace
- Drill down to pod
Debug Application Errors
-
Check logs in Loki:
-
View traces in Tempo:
- Find trace ID from logs
- Search in Tempo
- Analyze span duration
-
Check metrics:
Alert on Condition
-
Test query in Prometheus:
-
Create PrometheusRule in nixidy:
-
Apply and verify:
Troubleshooting
Grafana Won’t Load
Check pod status:Prometheus Not Scraping
- Check targets: http://localhost:30090/targets
- Look for service monitors:
- Verify Prometheus config: