Monitoring and Observability
This guide covers monitoring, logging, and debugging strategies for Mimir AIP deployments, including metrics collection, log analysis, and troubleshooting common issues.Architecture Overview
Mimir AIP consists of three main components:- Orchestrator - API server, metadata store, worker scheduler
- Workers - Kubernetes Jobs that execute pipelines and ML tasks
- Storage Plugins - Dynamically loaded storage backends
Logging
Log Locations
Orchestrator Logs:Log Levels
Configure log verbosity in Helm values:debug- Detailed execution traces, plugin loading, queue operationsinfo- Normal operations, task start/complete, configuration changeswarn- Recoverable errors, retry attempts, deprecated API usageerror- Critical failures, unhandled exceptions, fatal errors
Structured Logging
Logs follow a structured format for easy parsing:Key Log Patterns
Worker Lifecycle:Metrics Collection
Built-in Metrics
Mimir AIP exposes operational metrics through its API and logs: Queue Metrics:Kubernetes Resource Metrics
Pod Resource Usage:Prometheus Integration
For production deployments, integrate with Prometheus for comprehensive metrics:Prometheus ServiceMonitor
Key Metrics to Track
Queue Metrics:mimir_queue_length- Current task queue depthmimir_queue_enqueue_rate- Tasks added per secondmimir_queue_dequeue_rate- Tasks processed per secondmimir_queue_wait_time_seconds- Time tasks spend in queue
mimir_workers_active- Currently running workersmimir_workers_spawned_total- Total workers spawnedmimir_workers_failed_total- Failed worker countmimir_worker_duration_seconds- Worker execution time histogram
mimir_tasks_total{status=completed}- Completed tasks countermimir_tasks_total{status=failed}- Failed tasks countermimir_task_retries_total- Task retry countermimir_task_duration_seconds- Task execution time histogram
mimir_storage_operations_total{operation=store}- Store operations countermimir_storage_operations_total{operation=retrieve}- Retrieve operations countermimir_storage_latency_seconds- Storage operation latency histogrammimir_storage_errors_total- Storage error counter
Sample Prometheus Queries
Grafana Dashboards
Create Grafana dashboards to visualize metrics: Mimir AIP Overview Dashboard:- Queue depth over time (line graph)
- Active workers (gauge)
- Task completion rate (graph)
- Task failure rate (graph)
- Worker resource usage (heatmap)
- Pipeline execution time by pipeline ID
- Step execution time breakdown
- Plugin performance comparison
- Error rate by pipeline
- Storage operation latency by operation type
- Storage throughput (ops/sec)
- Storage error rate
- Connection pool utilization
Health Checks
Orchestrator Health
Storage Health
Worker Health
Debugging
Common Issues
Issue: Tasks Stuck in Queue
Symptoms:- Queue length increasing
- No workers spawning
kubectl get podsshows no worker pods
- Verify
rbac.create: truein Helm values - Check orchestrator has permission to create Jobs
- Ensure
workerNamespacematches deployment namespace
Issue: Workers Failing Immediately
Symptoms:- Workers spawn but fail quickly
- Status shows
ErrororCrashLoopBackOff
- Plugin compilation failure (missing dependencies)
- Orchestrator URL unreachable from worker
- Missing WORKTASK_ID or WORKTASK_TYPE environment variables
- Insufficient resources (OOMKilled)
Issue: Plugin Compilation Failures
Symptoms:- Worker logs show “compilation failed”
- Pipeline execution fails at plugin action
- Plugin has incompatible Go version
- Missing required packages in plugin go.mod
- Plugin doesn’t export
Pluginsymbol - CGO_ENABLED=0 (should be 1)
- Verify plugin.yaml exists and is valid
- Ensure plugin exports
var Plugin MyPlugin - Test plugin compilation locally:
Issue: Storage Operations Timing Out
Symptoms:- Store/retrieve operations fail with timeout
- High storage latency
- Verify storage credentials in config
- Check network connectivity to storage backend
- Increase timeout in storage plugin configuration
- Add connection pooling to storage plugin
Interactive Debugging
Debug Worker Locally:Alerting
Recommended Alerts
Critical Alerts:Troubleshooting Checklist
When investigating issues:-
Check Component Status
- Orchestrator pod is Running
- Frontend pod is Running (if enabled)
- Storage backends are accessible
-
Review Recent Logs
- Orchestrator logs for errors
- Worker logs for failures
- Storage plugin initialization logs
-
Verify Configuration
- Helm values are correct
- Storage configs have valid credentials
- Plugin repositories are accessible
-
Check Resources
- Pods not OOMKilled
- Nodes have available capacity
- PVC has sufficient space
-
Test Connectivity
- Workers can reach orchestrator
- Orchestrator can reach storage backends
- Frontend can reach orchestrator (if applicable)
Best Practices
-
Enable Structured Logging
- Use JSON log format for easier parsing
- Include correlation IDs (task_id, pipeline_id)
- Log at appropriate levels
-
Set Up Centralized Logging
- Use Fluentd/Fluent Bit to collect logs
- Forward to Elasticsearch, Loki, or CloudWatch
- Implement log retention policies
-
Monitor Key Metrics
- Track queue depth and wait time
- Monitor worker success/failure rates
- Measure storage operation latency
- Alert on anomalies
-
Implement Health Checks
- Configure liveness and readiness probes
- Test storage health periodically
- Monitor plugin compilation success
-
Maintain Observability
- Keep logs for minimum 7 days
- Export metrics to long-term storage
- Document alert runbooks
- Review metrics regularly
-
Debug Proactively
- Test deployments in staging first
- Validate plugins before production use
- Implement canary deployments for updates
- Keep worker image versions consistent