Container Alerts
Set up automated alerts to monitor container health, performance, and resource usage. Create alerts based on metrics thresholds and receive notifications when conditions are met.
GCP Cloud Run Alerts
Cloud Run alerts use Google Cloud Monitoring to trigger notifications based on metric thresholds.
List Alerts
Retrieve all alert policies for a specific Cloud Run service.
Endpoint
GET /api/gcp/containers/{projectId}/{containerName}/alerts
Parameters
Response
{
"value" : [
{
"name" : "1234567890123456789" ,
"displayName" : "High Request Latency Alert" ,
"enabled" : true ,
"description" : "Alert when P95 latency exceeds 1000ms for 5 minutes"
},
{
"name" : "9876543210987654321" ,
"displayName" : "High Request Count" ,
"enabled" : true ,
"description" : "Brak opisu."
}
]
}
Implementation
from google.cloud import monitoring_v3
def list_gcp_container_alerts ( project_id , container_name ):
accounts = session.get( "accounts" , [])
gcp_account = next (
(acc for acc in accounts if acc.get( "provider" ) == "gcp" ),
None
)
credentials = SessionCredentials(gcp_account)
client = monitoring_v3.AlertPolicyServiceClient( credentials = credentials)
project_name = f "projects/ { project_id } "
request = monitoring_v3.ListAlertPoliciesRequest( name = project_name)
policies = client.list_alert_policies( request = request)
container_alerts = []
filter_str_1 = f 'resource.labels.service_name = " { container_name } "'
filter_str_2 = f 'resource.type = "cloud_run_revision"'
# Filter policies related to this container
for policy in policies:
found = False
for condition in policy.conditions:
filter_text = ""
if condition.condition_threshold and condition.condition_threshold.filter:
filter_text = condition.condition_threshold.filter
elif condition.condition_absent and condition.condition_absent.filter:
filter_text = condition.condition_absent.filter
if filter_str_1 in filter_text and filter_str_2 in filter_text:
found = True
break
if found:
container_alerts.append({
"name" : policy.name.split( '/' )[ - 1 ],
"displayName" : policy.display_name,
"enabled" : policy.enabled,
"description" : policy.documentation.content if policy.documentation else "Brak opisu."
})
return jsonify({ "value" : container_alerts}), 200
Only alert policies that specifically filter for the Cloud Run service and resource type are returned. Project-wide alerts are excluded.
Create Alert
Create a new alert policy for a Cloud Run service.
Endpoint
POST /api/gcp/containers/{projectId}/{region}/{containerName}/alerts
Request Body
{
"alertName" : "High Request Latency Alert" ,
"metricType" : "run.googleapis.com/request_latencies" ,
"threshold" : 1000
}
Parameters
Cloud Run service region (e.g., europe-west1)
Display name for the alert policy
Metric type to monitor:
run.googleapis.com/request_count - Request count
run.googleapis.com/request_latencies - Request latency (P95)
run.googleapis.com/container/instance_count - Instance count
Threshold value that triggers the alert
Response
{
"message" : "Utworzono alert 'High Request Latency Alert'. (Uwaga: nie skonfigurowano kanałów notyfikacji)." ,
"name" : "1234567890123456789" ,
"displayName" : "High Request Latency Alert"
}
Implementation
from google.cloud import monitoring_v3
def create_gcp_container_alert ( project_id , region , container_name ):
accounts = session.get( "accounts" , [])
gcp_account = next (
(acc for acc in accounts if acc.get( "provider" ) == "gcp" ),
None
)
data = request.get_json()
display_name = data.get( "alertName" )
metric_type = data.get( "metricType" )
threshold = data.get( "threshold" )
credentials = SessionCredentials(gcp_account)
client = monitoring_v3.AlertPolicyServiceClient( credentials = credentials)
project_name = f "projects/ { project_id } "
# Create condition
condition = monitoring_v3.AlertPolicy.Condition(
display_name = f " { metric_type } > { threshold } przez 5 minut" ,
condition_threshold = monitoring_v3.AlertPolicy.Condition.MetricThreshold(
filter = (
f 'metric.type = " { metric_type } " AND '
f 'resource.type = "cloud_run_revision" AND '
f 'resource.labels.service_name = " { container_name } " AND '
f 'resource.labels.location = " { region } "'
),
aggregations = [
monitoring_v3.Aggregation(
alignment_period = { "seconds" : 60 },
per_series_aligner = monitoring_v3.Aggregation.Aligner. ALIGN_MEAN ,
)
],
comparison = monitoring_v3.ComparisonType. COMPARISON_GT ,
threshold_value = float (threshold),
duration = { "seconds" : 300 }, # 5 minutes
trigger = monitoring_v3.AlertPolicy.Condition.Trigger( count = 1 ),
),
)
# Create alert policy
policy = monitoring_v3.AlertPolicy(
display_name = display_name,
combiner = monitoring_v3.AlertPolicy.ConditionCombinerType. AND ,
conditions = [condition],
)
request_data = monitoring_v3.CreateAlertPolicyRequest(
name = project_name,
alert_policy = policy
)
created_policy = client.create_alert_policy( request = request_data)
return jsonify({
"message" : f "Utworzono alert ' { created_policy.display_name } '. (Uwaga: nie skonfigurowano kanałów notyfikacji)." ,
"name" : created_policy.name.split( '/' )[ - 1 ],
"displayName" : created_policy.display_name
}), 201
Alert Configuration Details
Duration:
Alerts trigger after the condition persists for 5 minutes (300 seconds)
This prevents false positives from temporary spikes
Modify duration parameter to change the evaluation window
Aggregation:
Metrics are aggregated over 60-second intervals
ALIGN_MEAN calculates the average value per interval
Use ALIGN_SUM for counters or ALIGN_MAX for peak values
Comparison:
COMPARISON_GT: Greater than threshold
COMPARISON_GE: Greater than or equal
COMPARISON_LT: Less than
COMPARISON_LE: Less than or equal
Trigger:
count=1: Alert fires as soon as condition is met
Increase count to require multiple consecutive violations
This implementation creates alert policies without notification channels. Configure notification channels in the GCP Console or via API to receive alerts via email, SMS, Slack, PagerDuty, etc.
Delete Alert
Delete an existing alert policy.
Endpoint
DELETE /api/gcp/containers/{projectId}/alerts/{alertName}
Parameters
Alert policy name (numeric ID)
Response
{
"message" : "Alert '1234567890123456789' został pomyślnie usunięty."
}
Implementation
def delete_gcp_container_alert ( project_id , alert_name ):
accounts = session.get( "accounts" , [])
gcp_account = next (
(acc for acc in accounts if acc.get( "provider" ) == "gcp" ),
None
)
credentials = SessionCredentials(gcp_account)
client = monitoring_v3.AlertPolicyServiceClient( credentials = credentials)
policy_full_name = f "projects/ { project_id } /alertPolicies/ { alert_name } "
request_data = monitoring_v3.DeleteAlertPolicyRequest( name = policy_full_name)
client.delete_alert_policy( request = request_data)
return jsonify({
"message" : f "Alert ' { alert_name } ' został pomyślnie usunięty."
}), 200
Deleting an alert policy is permanent and cannot be undone. The alert will immediately stop monitoring the service.
Common Alert Scenarios
High Request Latency
Alert when P95 request latency exceeds acceptable limits.
{
"alertName" : "High Request Latency" ,
"metricType" : "run.googleapis.com/request_latencies" ,
"threshold" : 1000
}
Use case: Detect performance degradation before it impacts users.
Threshold guidance:
Web applications: 200-500ms
APIs: 100-200ms
Background services: 1000-5000ms
Request Count Spike
Alert when request count exceeds normal traffic patterns.
{
"alertName" : "Unusual Request Volume" ,
"metricType" : "run.googleapis.com/request_count" ,
"threshold" : 10000
}
Use case: Detect traffic spikes from marketing campaigns, DDoS attacks, or viral content.
Threshold guidance:
Calculate baseline from historical data
Set threshold at 2-3x normal peak traffic
Adjust based on scaling capacity
Instance Count Alert
Alert when container instance count indicates scaling issues.
{
"alertName" : "High Instance Count" ,
"metricType" : "run.googleapis.com/container/instance_count" ,
"threshold" : 50
}
Use case: Detect unexpected scaling events or runaway containers.
Threshold guidance:
Set below configured max_instance_count
Consider cost implications of sustained high instance counts
Alert when approaching quota limits
Low Request Count
Alert when request count drops unexpectedly (service health check).
{
"alertName" : "Service Availability Issue" ,
"metricType" : "run.googleapis.com/request_count" ,
"threshold" : 10
}
Configuration: Change comparison to COMPARISON_LT (less than).
Use case: Detect service outages, DNS issues, or upstream failures.
Alert Best Practices
Threshold Selection
Baseline metrics first: Collect 1-2 weeks of data before setting thresholds
Avoid false positives: Set thresholds with buffer above normal variance
Consider time of day: Use different thresholds for peak vs off-peak hours
Test alerts: Trigger test conditions to validate notification delivery
Alert Fatigue Prevention
Actionable alerts only: Each alert should require human action
Appropriate duration: Use 5+ minute windows to avoid transient noise
Consolidate conditions: Combine related metrics into single alerts
Regular review: Disable or adjust alerts that trigger frequently without issues
Notification Channels
Configure notification channels for different severity levels:
Critical alerts (immediate action required):
PagerDuty for on-call rotation
SMS for urgent notifications
Phone calls for P0 incidents
Warning alerts (monitor closely):
Slack channels for team visibility
Email for documentation trail
Webhooks for automated responses
Info alerts (awareness):
Email digests
Dashboard visualization
Log aggregation
Multi-Condition Alerts
Create sophisticated alerts by combining multiple conditions:
# Alert when both latency is high AND error rate increases
policy = monitoring_v3.AlertPolicy(
display_name = "Service Degradation" ,
combiner = monitoring_v3.AlertPolicy.ConditionCombinerType. AND ,
conditions = [latency_condition, error_rate_condition],
)
Documentation
Include clear documentation in alert descriptions:
policy = monitoring_v3.AlertPolicy(
display_name = "High Request Latency" ,
documentation = monitoring_v3.AlertPolicy.Documentation(
content = """## Runbook: High Request Latency
**Severity:** P2
**Impact:** User experience degradation
**Investigation steps:**
1. Check Cloud Run logs for errors
2. Review recent deployments
3. Verify database performance
4. Check external API response times
**Mitigation:**
- Increase max_instance_count if at limit
- Roll back recent deployment if applicable
- Scale up instance resources
**Escalation:** Contact backend team lead after 15 minutes
""" ,
mime_type = "text/markdown"
),
conditions = [condition],
)
Azure Container Instances Alerts
Azure Container Instances alerts are configured through Azure Monitor. While the current implementation focuses on GCP Cloud Run, similar patterns apply:
Metric Alerts
Create alerts based on CPU and memory metrics:
CpuUsage > threshold for X minutes
MemoryUsage > threshold for X minutes
Container state changes (Running → Failed)
Log Query Alerts
Create alerts based on Log Analytics queries:
ContainerInstanceLog_CL
| where ContainerGroup_s == 'my-container'
| where Message contains 'ERROR'
| summarize ErrorCount= count() by bin (TimeGenerated, 5m )
| where ErrorCount > 10
Configuration via Azure Portal
Navigate to Container Instance in Azure Portal
Select “Alerts” from left menu
Click “New alert rule”
Configure signal, condition, and action group
Save alert rule
Azure alerts support action groups for notifications (email, SMS, webhook, Logic Apps, Azure Functions) and automated remediation.
Error Handling
Common Errors
Unauthorized (401):
{
"error" : "Nie znaleziono aktywnego konta GCP w sesji"
}
Solution: Ensure GCP account is authenticated with valid refresh token.
Bad Request (400):
{
"error" : "Wymagane pola: alertName, metricType, threshold"
}
Solution: Provide all required parameters in request body.
Forbidden (403):
{
"error" : "Permission denied on resource project my-project"
}
Solution: Grant Monitoring Admin or Monitoring Alert Policy Editor role.
Server Error (500):
{
"error" : "Błąd podczas tworzenia alertu: Invalid metric type"
}
Solution: Verify metric type is valid for Cloud Run.
Monitoring Alert Health
Alert Testing
Test alerts before relying on them in production:
Trigger threshold artificially: Generate load to exceed threshold
Verify notification delivery: Confirm all channels receive alerts
Test escalation paths: Ensure on-call rotations work correctly
Measure response time: Track time from alert to mitigation
Alert Metrics
Track alert effectiveness:
True positive rate: Alerts that identified real issues
False positive rate: Alerts without actual problems
Mean time to acknowledge (MTTA): How quickly alerts are noticed
Mean time to resolve (MTTR): How quickly issues are fixed
Regular Maintenance
Weekly: Review triggered alerts and response actions
Monthly: Adjust thresholds based on traffic patterns
Quarterly: Audit all alert policies for relevance
After incidents: Update alerts to catch similar issues earlier
Container Monitoring Monitor container metrics and logs
GCP Containers Manage GCP Cloud Run services
Azure Containers Manage Azure Container Instances