Skip to main content
Monitor the CGIAR Risk Intelligence Tool’s Lambda functions, CloudWatch logs, and system metrics to ensure reliable operation.

CloudWatch Log Groups

All Lambda functions automatically create CloudWatch log groups with structured logging:

API Lambda

Log Group: /aws/lambda/alliance-risk-api
  • HTTP request/response logs
  • Authentication events (Cognito token verification)
  • Database connection status
  • Error stack traces

Worker Lambda

Log Group: /aws/lambda/alliance-risk-worker
  • Job processing lifecycle (PENDING → PROCESSING → COMPLETED/FAILED)
  • Bedrock model invocations
  • SQL execution logs (migrations via run-sql action)
  • Job retry attempts and failures

Accessing CloudWatch Logs

AWS Console

  1. Navigate to CloudWatchLog Groups
  2. Select /aws/lambda/alliance-risk-api or /aws/lambda/alliance-risk-worker
  3. Click Search all log streams
  4. Use filter patterns (see below)

AWS CLI

# Tail API Lambda logs (last 10 minutes)
aws logs tail /aws/lambda/alliance-risk-api --follow --since 10m

# Tail Worker Lambda logs with filter
aws logs tail /aws/lambda/alliance-risk-worker --follow \
  --filter-pattern "ERROR"

# Get logs for specific time range
aws logs filter-log-events \
  --log-group-name /aws/lambda/alliance-risk-api \
  --start-time $(date -d '1 hour ago' +%s)000 \
  --filter-pattern "statusCode>=500"

Log Filter Patterns

Error Detection

# All errors (NestJS Logger)
[level=error]

# HTTP 5xx errors
"statusCode":5??

# Specific error types
"PrismaClientKnownRequestError"
"ThrottlingException"
"CircuitOpenError"

Authentication Issues

# Cognito errors
"NotAuthorizedException" OR "UserNotFoundException"

# Token verification failures
"JwtAuthGuard" "denied"

# Admin permission failures
"AdminGuard" "403"

Performance Monitoring

# Slow Bedrock calls (>10s)
"Bedrock" "processingTime" { $.processingTime > 10000 }

# Database connection issues
"Database connected" OR "Database disconnected"

# Lambda cold starts
"INIT_START"

Key Metrics to Monitor

Lambda Metrics (CloudWatch)

MetricThresholdAction
InvocationsBaselineTrack request volume trends
Errors>1% of invocationsInvestigate error logs
Duration>25s (API), >14min (Worker)Optimize or increase timeout
Throttles>0Increase concurrency limit
ConcurrentExecutionsNear account limitRequest limit increase
IteratorAgeN/A (not using streams)-

API Lambda Alerts

Metric: Errors
Threshold: Sum > 10 in 5 minutes
Action: SNS notification to operations team
Metric: Duration
Threshold: Average > 25000ms (approaching 29s timeout)
Action: Investigate slow queries/Bedrock calls

Worker Lambda Alerts

Metric: Errors
Threshold: Sum > 5 in 15 minutes
Action: Check job queue, review Bedrock throttling
Metric: Throttles
Threshold: Sum > 0
Action: Increase reserved concurrency

Log Analysis Examples

Find Failed Jobs

aws logs filter-log-events \
  --log-group-name /aws/lambda/alliance-risk-worker \
  --filter-pattern '"Job" "failed"' \
  --query 'events[*].[timestamp,message]' \
  --output text

Track Bedrock Usage

# Count Bedrock invocations by model
aws logs filter-log-events \
  --log-group-name /aws/lambda/alliance-risk-worker \
  --filter-pattern '"Invoking Bedrock model"' \
  --start-time $(date -d '1 day ago' +%s)000 | \
  jq -r '.events[].message' | \
  grep -oP 'modelId: \K[^"]+' | \
  sort | uniq -c

Identify Prisma Connection Issues

aws logs filter-log-events \
  --log-group-name /aws/lambda/alliance-risk-api \
  --filter-pattern '"Prisma" ("code" OR "meta")' \
  --max-items 50

Structured Logging

The application uses NestJS Logger with structured output:

Log Levels

LevelUsageCloudWatch Filter
errorUnhandled exceptions, critical failures[level=error]
warnHTTP 4xx, retryable failures[level=warn]
logHTTP requests, job lifecycle[level=log]
debugDetailed diagnostics (dev only)[level=debug]

Sample Log Entries

Successful Request:
{
  "level": "log",
  "timestamp": "2026-03-04T14:23:45.123Z",
  "context": "RouterExplorer",
  "message": "GET /api/prompts/section/parser 200"
}
Prisma Error:
{
  "level": "error",
  "timestamp": "2026-03-04T14:23:45.123Z",
  "context": "PromptsService",
  "message": "Prisma error code=P2002 meta={\"target\":[\"slug\"]}"
}
Bedrock Invocation:
{
  "level": "log",
  "timestamp": "2026-03-04T14:23:45.123Z",
  "context": "BedrockService",
  "message": "Invoking Bedrock model: anthropic.claude-3-5-sonnet-20241022-v2:0"
}

Setting Up CloudWatch Alarms

Create Error Rate Alarm (AWS CLI)

aws cloudwatch put-metric-alarm \
  --alarm-name alliance-risk-api-errors \
  --alarm-description "Alert when API Lambda error rate exceeds 1%" \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --statistic Sum \
  --period 300 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --dimensions Name=FunctionName,Value=alliance-risk-api \
  --treat-missing-data notBreaching

Create Duration Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name alliance-risk-api-slow \
  --alarm-description "Alert when API Lambda approaches timeout" \
  --metric-name Duration \
  --namespace AWS/Lambda \
  --statistic Average \
  --period 300 \
  --threshold 25000 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --dimensions Name=FunctionName,Value=alliance-risk-api

Lambda Cold Starts

Identifying Cold Starts

Look for INIT_START messages in logs:
aws logs filter-log-events \
  --log-group-name /aws/lambda/alliance-risk-api \
  --filter-pattern "INIT_START" \
  --start-time $(date -d '1 hour ago' +%s)000

Cold Start Duration

Cold starts add 2-5 seconds to first invocation:
  • NestJS module initialization (~1-2s)
  • Prisma client generation (~500ms-1s)
  • Database connection pool setup (~500ms)

Mitigation Strategies

  1. Provisioned Concurrency (costs $$$):
    aws lambda put-provisioned-concurrency-config \
      --function-name alliance-risk-api \
      --provisioned-concurrent-executions 2
    
  2. Keep-Warm Schedule (EventBridge):
    • Invoke Lambda every 5 minutes with warmup event
    • Filter in code: if (event.source === 'warmup') return
  3. Accept Cold Starts (recommended for MVP):
    • Cold starts are infrequent with moderate traffic
    • First request after inactivity will be slower

Database Monitoring

RDS CloudWatch Metrics

MetricThresholdAction
CPUUtilization>80% sustainedUpgrade instance type
FreeableMemoryLess than 100MBUpgrade instance type
DatabaseConnections>80% of maxInvestigate connection leaks
ReadLatencyGreater than 100msAdd indexes, optimize queries
WriteLatencyGreater than 100msCheck disk I/O, upgrade storage

Connection Pool Monitoring

Prisma uses a connection pool per Lambda instance. Check logs for:
"Database connected"  → Successful pool creation
"Database disconnected" → Clean shutdown
Warning Signs:
  • Multiple “Database connected” per invocation (connection leak)
  • Lambda timeouts with no error (event loop not draining)

Bedrock Monitoring

Throttling Detection

Bedrock has model-specific rate limits. Check for:
aws logs filter-log-events \
  --log-group-name /aws/lambda/alliance-risk-worker \
  --filter-pattern "ThrottlingException"

Circuit Breaker Status

The app uses a circuit breaker for Bedrock calls:
  • CLOSED: Normal operation
  • OPEN: 3+ consecutive failures, all requests fail fast
  • HALF_OPEN: Testing recovery, 2 successes → CLOSED
Look for: "Circuit breaker is open" in logs

Retry Behavior

Bedrock calls retry 3 times with exponential backoff:
  • Base delay: 200ms
  • Max delay: 5s
  • Retries: ThrottlingException, ServiceUnavailableException

Next Steps

Build docs developers (and LLMs) love