Skip to main content

Container Alerts

Set up automated alerts to monitor container health, performance, and resource usage. Create alerts based on metrics thresholds and receive notifications when conditions are met.

GCP Cloud Run Alerts

Cloud Run alerts use Google Cloud Monitoring to trigger notifications based on metric thresholds.

List Alerts

Retrieve all alert policies for a specific Cloud Run service.

Endpoint

GET /api/gcp/containers/{projectId}/{containerName}/alerts

Parameters

projectId
string
required
GCP project ID
containerName
string
required
Cloud Run service name

Response

{
  "value": [
    {
      "name": "1234567890123456789",
      "displayName": "High Request Latency Alert",
      "enabled": true,
      "description": "Alert when P95 latency exceeds 1000ms for 5 minutes"
    },
    {
      "name": "9876543210987654321",
      "displayName": "High Request Count",
      "enabled": true,
      "description": "Brak opisu."
    }
  ]
}

Implementation

from google.cloud import monitoring_v3

def list_gcp_container_alerts(project_id, container_name):
    accounts = session.get("accounts", [])
    gcp_account = next(
        (acc for acc in accounts if acc.get("provider") == "gcp"),
        None
    )
    
    credentials = SessionCredentials(gcp_account)
    client = monitoring_v3.AlertPolicyServiceClient(credentials=credentials)
    project_name = f"projects/{project_id}"
    
    request = monitoring_v3.ListAlertPoliciesRequest(name=project_name)
    policies = client.list_alert_policies(request=request)
    
    container_alerts = []
    filter_str_1 = f'resource.labels.service_name = "{container_name}"'
    filter_str_2 = f'resource.type = "cloud_run_revision"'
    
    # Filter policies related to this container
    for policy in policies:
        found = False
        for condition in policy.conditions:
            filter_text = ""
            if condition.condition_threshold and condition.condition_threshold.filter:
                filter_text = condition.condition_threshold.filter
            elif condition.condition_absent and condition.condition_absent.filter:
                filter_text = condition.condition_absent.filter
            
            if filter_str_1 in filter_text and filter_str_2 in filter_text:
                found = True
                break
        
        if found:
            container_alerts.append({
                "name": policy.name.split('/')[-1],
                "displayName": policy.display_name,
                "enabled": policy.enabled,
                "description": policy.documentation.content if policy.documentation else "Brak opisu."
            })
    
    return jsonify({"value": container_alerts}), 200
Only alert policies that specifically filter for the Cloud Run service and resource type are returned. Project-wide alerts are excluded.

Create Alert

Create a new alert policy for a Cloud Run service.

Endpoint

POST /api/gcp/containers/{projectId}/{region}/{containerName}/alerts

Request Body

{
  "alertName": "High Request Latency Alert",
  "metricType": "run.googleapis.com/request_latencies",
  "threshold": 1000
}

Parameters

projectId
string
required
GCP project ID
region
string
required
Cloud Run service region (e.g., europe-west1)
containerName
string
required
Cloud Run service name
alertName
string
required
Display name for the alert policy
metricType
string
required
Metric type to monitor:
  • run.googleapis.com/request_count - Request count
  • run.googleapis.com/request_latencies - Request latency (P95)
  • run.googleapis.com/container/instance_count - Instance count
threshold
number
required
Threshold value that triggers the alert

Response

{
  "message": "Utworzono alert 'High Request Latency Alert'. (Uwaga: nie skonfigurowano kanałów notyfikacji).",
  "name": "1234567890123456789",
  "displayName": "High Request Latency Alert"
}

Implementation

from google.cloud import monitoring_v3

def create_gcp_container_alert(project_id, region, container_name):
    accounts = session.get("accounts", [])
    gcp_account = next(
        (acc for acc in accounts if acc.get("provider") == "gcp"),
        None
    )
    
    data = request.get_json()
    display_name = data.get("alertName")
    metric_type = data.get("metricType")
    threshold = data.get("threshold")
    
    credentials = SessionCredentials(gcp_account)
    client = monitoring_v3.AlertPolicyServiceClient(credentials=credentials)
    project_name = f"projects/{project_id}"
    
    # Create condition
    condition = monitoring_v3.AlertPolicy.Condition(
        display_name=f"{metric_type} > {threshold} przez 5 minut",
        condition_threshold=monitoring_v3.AlertPolicy.Condition.MetricThreshold(
            filter=(
                f'metric.type = "{metric_type}" AND '
                f'resource.type = "cloud_run_revision" AND '
                f'resource.labels.service_name = "{container_name}" AND '
                f'resource.labels.location = "{region}"'
            ),
            aggregations=[
                monitoring_v3.Aggregation(
                    alignment_period={"seconds": 60},
                    per_series_aligner=monitoring_v3.Aggregation.Aligner.ALIGN_MEAN,
                )
            ],
            comparison=monitoring_v3.ComparisonType.COMPARISON_GT,
            threshold_value=float(threshold),
            duration={"seconds": 300},  # 5 minutes
            trigger=monitoring_v3.AlertPolicy.Condition.Trigger(count=1),
        ),
    )
    
    # Create alert policy
    policy = monitoring_v3.AlertPolicy(
        display_name=display_name,
        combiner=monitoring_v3.AlertPolicy.ConditionCombinerType.AND,
        conditions=[condition],
    )
    
    request_data = monitoring_v3.CreateAlertPolicyRequest(
        name=project_name,
        alert_policy=policy
    )
    created_policy = client.create_alert_policy(request=request_data)
    
    return jsonify({
        "message": f"Utworzono alert '{created_policy.display_name}'. (Uwaga: nie skonfigurowano kanałów notyfikacji).",
        "name": created_policy.name.split('/')[-1],
        "displayName": created_policy.display_name
    }), 201

Alert Configuration Details

Duration:
  • Alerts trigger after the condition persists for 5 minutes (300 seconds)
  • This prevents false positives from temporary spikes
  • Modify duration parameter to change the evaluation window
Aggregation:
  • Metrics are aggregated over 60-second intervals
  • ALIGN_MEAN calculates the average value per interval
  • Use ALIGN_SUM for counters or ALIGN_MAX for peak values
Comparison:
  • COMPARISON_GT: Greater than threshold
  • COMPARISON_GE: Greater than or equal
  • COMPARISON_LT: Less than
  • COMPARISON_LE: Less than or equal
Trigger:
  • count=1: Alert fires as soon as condition is met
  • Increase count to require multiple consecutive violations
This implementation creates alert policies without notification channels. Configure notification channels in the GCP Console or via API to receive alerts via email, SMS, Slack, PagerDuty, etc.

Delete Alert

Delete an existing alert policy.

Endpoint

DELETE /api/gcp/containers/{projectId}/alerts/{alertName}

Parameters

projectId
string
required
GCP project ID
alertName
string
required
Alert policy name (numeric ID)

Response

{
  "message": "Alert '1234567890123456789' został pomyślnie usunięty."
}

Implementation

def delete_gcp_container_alert(project_id, alert_name):
    accounts = session.get("accounts", [])
    gcp_account = next(
        (acc for acc in accounts if acc.get("provider") == "gcp"),
        None
    )
    
    credentials = SessionCredentials(gcp_account)
    client = monitoring_v3.AlertPolicyServiceClient(credentials=credentials)
    policy_full_name = f"projects/{project_id}/alertPolicies/{alert_name}"
    
    request_data = monitoring_v3.DeleteAlertPolicyRequest(name=policy_full_name)
    client.delete_alert_policy(request=request_data)
    
    return jsonify({
        "message": f"Alert '{alert_name}' został pomyślnie usunięty."
    }), 200
Deleting an alert policy is permanent and cannot be undone. The alert will immediately stop monitoring the service.

Common Alert Scenarios

High Request Latency

Alert when P95 request latency exceeds acceptable limits.
{
  "alertName": "High Request Latency",
  "metricType": "run.googleapis.com/request_latencies",
  "threshold": 1000
}
Use case: Detect performance degradation before it impacts users. Threshold guidance:
  • Web applications: 200-500ms
  • APIs: 100-200ms
  • Background services: 1000-5000ms

Request Count Spike

Alert when request count exceeds normal traffic patterns.
{
  "alertName": "Unusual Request Volume",
  "metricType": "run.googleapis.com/request_count",
  "threshold": 10000
}
Use case: Detect traffic spikes from marketing campaigns, DDoS attacks, or viral content. Threshold guidance:
  • Calculate baseline from historical data
  • Set threshold at 2-3x normal peak traffic
  • Adjust based on scaling capacity

Instance Count Alert

Alert when container instance count indicates scaling issues.
{
  "alertName": "High Instance Count",
  "metricType": "run.googleapis.com/container/instance_count",
  "threshold": 50
}
Use case: Detect unexpected scaling events or runaway containers. Threshold guidance:
  • Set below configured max_instance_count
  • Consider cost implications of sustained high instance counts
  • Alert when approaching quota limits

Low Request Count

Alert when request count drops unexpectedly (service health check).
{
  "alertName": "Service Availability Issue",
  "metricType": "run.googleapis.com/request_count",
  "threshold": 10
}
Configuration: Change comparison to COMPARISON_LT (less than). Use case: Detect service outages, DNS issues, or upstream failures.

Alert Best Practices

Threshold Selection

  • Baseline metrics first: Collect 1-2 weeks of data before setting thresholds
  • Avoid false positives: Set thresholds with buffer above normal variance
  • Consider time of day: Use different thresholds for peak vs off-peak hours
  • Test alerts: Trigger test conditions to validate notification delivery

Alert Fatigue Prevention

  • Actionable alerts only: Each alert should require human action
  • Appropriate duration: Use 5+ minute windows to avoid transient noise
  • Consolidate conditions: Combine related metrics into single alerts
  • Regular review: Disable or adjust alerts that trigger frequently without issues

Notification Channels

Configure notification channels for different severity levels: Critical alerts (immediate action required):
  • PagerDuty for on-call rotation
  • SMS for urgent notifications
  • Phone calls for P0 incidents
Warning alerts (monitor closely):
  • Slack channels for team visibility
  • Email for documentation trail
  • Webhooks for automated responses
Info alerts (awareness):
  • Email digests
  • Dashboard visualization
  • Log aggregation

Multi-Condition Alerts

Create sophisticated alerts by combining multiple conditions:
# Alert when both latency is high AND error rate increases
policy = monitoring_v3.AlertPolicy(
    display_name="Service Degradation",
    combiner=monitoring_v3.AlertPolicy.ConditionCombinerType.AND,
    conditions=[latency_condition, error_rate_condition],
)

Documentation

Include clear documentation in alert descriptions:
policy = monitoring_v3.AlertPolicy(
    display_name="High Request Latency",
    documentation=monitoring_v3.AlertPolicy.Documentation(
        content="""## Runbook: High Request Latency
        
        **Severity:** P2
        **Impact:** User experience degradation
        
        **Investigation steps:**
        1. Check Cloud Run logs for errors
        2. Review recent deployments
        3. Verify database performance
        4. Check external API response times
        
        **Mitigation:**
        - Increase max_instance_count if at limit
        - Roll back recent deployment if applicable
        - Scale up instance resources
        
        **Escalation:** Contact backend team lead after 15 minutes
        """,
        mime_type="text/markdown"
    ),
    conditions=[condition],
)

Azure Container Instances Alerts

Azure Container Instances alerts are configured through Azure Monitor. While the current implementation focuses on GCP Cloud Run, similar patterns apply:

Metric Alerts

Create alerts based on CPU and memory metrics:
  • CpuUsage > threshold for X minutes
  • MemoryUsage > threshold for X minutes
  • Container state changes (Running → Failed)

Log Query Alerts

Create alerts based on Log Analytics queries:
ContainerInstanceLog_CL
| where ContainerGroup_s == 'my-container'
| where Message contains 'ERROR'
| summarize ErrorCount=count() by bin(TimeGenerated, 5m)
| where ErrorCount > 10

Configuration via Azure Portal

  1. Navigate to Container Instance in Azure Portal
  2. Select “Alerts” from left menu
  3. Click “New alert rule”
  4. Configure signal, condition, and action group
  5. Save alert rule
Azure alerts support action groups for notifications (email, SMS, webhook, Logic Apps, Azure Functions) and automated remediation.

Error Handling

Common Errors

Unauthorized (401):
{
  "error": "Nie znaleziono aktywnego konta GCP w sesji"
}
Solution: Ensure GCP account is authenticated with valid refresh token. Bad Request (400):
{
  "error": "Wymagane pola: alertName, metricType, threshold"
}
Solution: Provide all required parameters in request body. Forbidden (403):
{
  "error": "Permission denied on resource project my-project"
}
Solution: Grant Monitoring Admin or Monitoring Alert Policy Editor role. Server Error (500):
{
  "error": "Błąd podczas tworzenia alertu: Invalid metric type"
}
Solution: Verify metric type is valid for Cloud Run.

Monitoring Alert Health

Alert Testing

Test alerts before relying on them in production:
  1. Trigger threshold artificially: Generate load to exceed threshold
  2. Verify notification delivery: Confirm all channels receive alerts
  3. Test escalation paths: Ensure on-call rotations work correctly
  4. Measure response time: Track time from alert to mitigation

Alert Metrics

Track alert effectiveness:
  • True positive rate: Alerts that identified real issues
  • False positive rate: Alerts without actual problems
  • Mean time to acknowledge (MTTA): How quickly alerts are noticed
  • Mean time to resolve (MTTR): How quickly issues are fixed

Regular Maintenance

  • Weekly: Review triggered alerts and response actions
  • Monthly: Adjust thresholds based on traffic patterns
  • Quarterly: Audit all alert policies for relevance
  • After incidents: Update alerts to catch similar issues earlier

Container Monitoring

Monitor container metrics and logs

GCP Containers

Manage GCP Cloud Run services

Azure Containers

Manage Azure Container Instances

Build docs developers (and LLMs) love