Monitor your cloud infrastructure health, metrics, and costs with Clanker
Clanker provides comprehensive monitoring capabilities across your cloud infrastructure. Use natural language queries to check resource health, view metrics, analyze costs, and detect anomalies.
Use natural language to query the health of your services:
# Lambda functionsclanker ask "What's the status of my Lambda functions?"clanker ask "Show me Lambda functions with high error rates"# EC2 instancesclanker ask "Are all EC2 instances running?"clanker ask "Show me stopped or terminated instances"# RDS databasesclanker ask "What's the current RDS instance status?"clanker ask "Show me RDS instances with low storage"# Load balancersclanker ask "What's the health status of my load balancers?"clanker ask "Show me unhealthy targets in my ALBs"
Example response:
# Lambda Function Health Status## Healthy Functions (3)### api-handler- **Invocations (24h)**: 12,453- **Error Rate**: 0.2%- **Duration (avg)**: 245ms- **Throttles**: 0### image-processor- **Invocations (24h)**: 3,241- **Error Rate**: 0.1%- **Duration (avg)**: 1,823ms- **Throttles**: 0## Functions with Issues (1)### email-sender ⚠️- **Invocations (24h)**: 891- **Error Rate**: 12.3% (elevated)- **Duration (avg)**: 512ms- **Throttles**: 15 (capacity issue)- **Recent Errors**: - `Task timed out after 3.00 seconds` (8 occurrences) - `Unable to connect to SMTP server` (102 occurrences)**Recommended Actions:**1. Increase timeout for email-sender (currently 3s)2. Check SMTP credentials in Secrets Manager3. Consider increasing reserved concurrency
# Cluster overviewclanker k8s ask "tell me the health of my cluster"# Node statusclanker k8s ask "are all nodes ready?"clanker k8s ask "show me nodes with resource pressure"# Pod healthclanker k8s ask "show me pods that are not running"clanker k8s ask "which pods have been restarted recently?"# Resource capacityclanker k8s ask "how much CPU and memory is available?"
# CPU utilizationclanker ask "Show me EC2 instances with high CPU usage"clanker ask "What's the CPU trend for my-instance over the last 24 hours?"# Memory and diskclanker ask "Show me instances with low disk space"clanker ask "What's the memory usage pattern for my Lambda?"# Network throughputclanker ask "Show me network traffic for my load balancer"clanker ask "Which instances have the highest network egress?"# Custom metricsclanker ask "Show me application error rates from CloudWatch"
The postgres-primary-0 pod is using 120% CPU, which means it’s using more than 1 full core. This indicates it might need more resources or optimization.
# Find cost anomalies and wasteclanker cost anomalies
Anomaly detection output:
========================================Cost Anomalies Detected========================================⚠️ High Impact Anomalies (3)1. EC2 Instance: i-0a1b2c3d4e5f6 - Idle for 14 days (0.2% CPU avg) - Monthly waste: ~$125 - Recommendation: Stop or terminate2. RDS Instance: dev-test-db - Running in production region - No connections in 30 days - Monthly waste: ~$178 - Recommendation: Take final snapshot and delete3. EBS Volumes: 8 unattached volumes - Total size: 1.2 TB - Monthly waste: ~$120 - Recommendation: Delete or attach to instances💡 Low Impact Opportunities (5)4. S3 Bucket: logs-archive-2023 - 456 GB in Standard tier - Not accessed in 180 days - Monthly savings: ~$10 - Recommendation: Move to Glacier[...]Total Potential Monthly Savings: $523
# Recent errorsclanker ask "Show me recent errors in CloudWatch Logs"# Lambda logsclanker ask "Show me the last 100 log entries for my-function"# Search patternsclanker ask "Find ERROR in application logs from the last hour"# Multiple log groupsclanker ask "Search for 'timeout' across all Lambda log groups"
# Kubernetesclanker k8s ask "show me error logs from nginx pods"clanker k8s ask "show recent logs from pods in production namespace"# AWSclanker ask "show me Lambda errors from the last 24 hours"clanker ask "find API Gateway 5xx errors"
# View alarmsclanker ask "Show me CloudWatch alarms in ALARM state"clanker ask "What alarms triggered in the last 24 hours?"# Create alarms with makerclanker ask --maker "create a CloudWatch alarm for high CPU on my-instance"
#!/bin/bash# daily-health-check.shecho "=== Daily Infrastructure Health Check ==="echo ""echo "EC2 Health:"clanker ask "Show me EC2 instances that are stopped or have issues"echo ""echo "Lambda Health:"clanker ask "Show me Lambda functions with errors in the last 24 hours"echo ""echo "RDS Health:"clanker ask "Show me RDS instance status and any alerts"echo ""echo "Cost Check:"clanker cost summary --format json | jq -r '.totalCost'
# Verify CloudWatch agent is running (EC2)clanker ask "Check if CloudWatch agent is running on my instances"# Check metrics-server (Kubernetes)kubectl get deployment metrics-server -n kube-system# Verify IAM permissionsclanker ask "Check CloudWatch permissions for my instances"
Cost data missing
Cost Explorer data can take 24 hours to appear:
# Check if Cost Explorer is enabledaws ce get-cost-and-usage --time-period Start=2026-03-01,End=2026-03-02 --granularity DAILY --metrics UnblendedCost# Use a date range at least 2 days oldclanker cost summary --start 2026-02-01 --end 2026-02-28
High memory in stats output
If you see >100% CPU or memory:
>100% CPU: Pod is using multiple cores (e.g., 250% = 2.5 cores)
>100% memory: Pod is using more than its requested memory (not limits)