Skip to main content
Diagnose and resolve common issues in the CGIAR Risk Intelligence Tool deployment.

Lambda Issues

Lambda Returns HTTP 500 (CORS Error on Frontend)

Symptoms:
  • Frontend shows CORS error or “Network Error”
  • API Gateway returns 502 Bad Gateway
  • CloudWatch shows Lambda error
Diagnosis: Check CloudWatch logs:
aws logs tail /aws/lambda/alliance-risk-api --since 5m
Common Causes:

1. Missing Module After esbuild Bundle

Error:
Cannot find module '@prisma/adapter-pg'
Solution: The package wasn’t included in EXTERNALS in scripts/deploy-api.sh. Add it:
EXTERNALS=(
  "@prisma/client"
  ".prisma/client"
  "@prisma/adapter-pg"  # ← Already included
  "your-missing-package"  # ← Add here
)
Redeploy:
bash scripts/deploy-api.sh

2. ESM/CJS Interop Error

Error:
(0, express_1.default) is not a function
Solution: Ensure packages/api/tsconfig.json has:
{
  "compilerOptions": {
    "esModuleInterop": true
  }
}
Rebuild and redeploy:
pnpm --filter @alliance-risk/api build
bash scripts/deploy-api.sh

3. Missing Environment Variables

Error:
Cannot read property 'userPoolId' of undefined
Solution: Verify Lambda environment variables:
aws lambda get-function-configuration \
  --function-name alliance-risk-api \
  --query 'Environment.Variables'
Required variables:
  • COGNITO_USER_POOL_ID
  • COGNITO_CLIENT_ID
  • DATABASE_URL
  • S3_BUCKET_NAME
  • WORKER_FUNCTION_NAME
  • ENVIRONMENT
Update via CloudFormation or AWS Console → Lambda → Configuration → Environment variables.

Lambda Timeout (29s for API, 15min for Worker)

Symptoms:
  • Request hangs then fails
  • CloudWatch shows Task timed out after 29.00 seconds
Diagnosis: Check duration before timeout:
aws logs filter-log-events \
  --log-group-name /aws/lambda/alliance-risk-api \
  --filter-pattern "Task timed out" \
  --max-items 10
Common Causes:

1. Event Loop Not Draining (Prisma Connections)

Symptom: Lambda hangs for full timeout even after handler completes. Solution: Verify context.callbackWaitsForEmptyEventLoop = false in both Lambda handlers. API Lambda (src/lambda.ts:29):
export const handler = async (event: any, context: any, callback: any) => {
  context.callbackWaitsForEmptyEventLoop = false; // ← Must be present
  // ...
}
Worker Lambda (src/worker.ts:26):
export const handler = async (event: WorkerEvent, context: any) => {
  context.callbackWaitsForEmptyEventLoop = false; // ← Must be present
  // ...
}

2. Slow Bedrock Calls

Symptom: Worker Lambda times out processing large documents. Solution: Check Bedrock processing time:
aws logs filter-log-events \
  --log-group-name /aws/lambda/alliance-risk-worker \
  --filter-pattern '"processingTime"' | \
  jq -r '.events[].message' | grep processingTime
If processingTime > 600000 (10 minutes):
  1. Increase Worker timeout:
    aws lambda update-function-configuration \
      --function-name alliance-risk-worker \
      --timeout 900  # 15 minutes
    
  2. Split large documents into chunks (application logic change required).

3. Database Query Slow

Symptom: API Lambda times out on specific endpoints. Check RDS performance:
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=<your-db-id> \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average
Solutions:
  • Add database indexes
  • Optimize Prisma queries (avoid N+1)
  • Upgrade RDS instance type

Prisma Issues

Error: PrismaClientKnownRequestError with Empty Message

Symptoms:
  • CloudWatch shows PrismaClientKnodRequestError but message is empty
  • Error details missing in logs
Cause: esbuild bundles strip Prisma error messages. Solution: Always log .code and .meta instead of .message:
catch (error) {
  if (error && typeof error === 'object' && 'code' in error) {
    logger.error(`Prisma error code=${(error as any).code} meta=${JSON.stringify((error as any).meta)}`);
  }
}
This pattern is already implemented in the codebase (see packages/api/CLAUDE.md:143).

Common Prisma Error Codes

CodeMeaningSolution
P2002Unique constraint violationCheck for duplicate slug/email
P2025Record not foundVerify ID exists before update/delete
P2003Foreign key constraint failedEnsure referenced record exists
P2024Connection timeoutCheck RDS status, security groups
P1001Can’t reach databaseVerify DATABASE_URL env var
P1017Server closed connectionRDS restarted, retry query

Database Connection Failures

Error:
Prisma error code=P1001 meta={"target":"localhost:5432"}
Cause: DATABASE_URL not set or incorrect. Diagnosis:
aws lambda get-function-configuration \
  --function-name alliance-risk-api \
  --query 'Environment.Variables.DATABASE_URL'
Expected format:
postgresql://postgres:[email protected]:5432/alliance_risk
Solution:
  1. Get DB password from Secrets Manager:
    aws secretsmanager get-secret-value \
      --secret-id alliance-risk/db-credentials \
      --query SecretString --output text | jq -r .password
    
  2. Get RDS endpoint:
    aws rds describe-db-instances \
      --query 'DBInstances[0].Endpoint.Address' \
      --output text
    
  3. Update Lambda environment variable via CloudFormation or Console.

Migration Failed via Worker Lambda

Error:
{"success":false,"executed":5,"error":"duplicate key value violates unique constraint"}
Cause: Migration already partially applied. Solution:
  1. Check migration status:
    aws lambda invoke --function-name alliance-risk-worker \
      --payload '{"action":"run-sql","sql":"SELECT migration_name, finished_at FROM _prisma_migrations ORDER BY finished_at DESC LIMIT 5"}' \
      /tmp/result.json && cat /tmp/result.json
    
  2. Manually mark failed migration as rolled back or completed.
  3. Re-run:
    pnpm migrate:remote
    

Cognito Issues

Error: NotAuthorizedException

Symptoms:
  • Login fails with “Invalid credentials”
  • CloudWatch shows NotAuthorizedException
Common Causes:
Error NameHTTP StatusMeaning
NotAuthorizedException401Wrong email/password
UserNotFoundException404User doesn’t exist
UserNotConfirmedException403Email not verified
CodeMismatchException400Invalid verification code
ExpiredCodeException400Verification code expired
LimitExceededException429Too many login attempts
See packages/api/src/common/exceptions/cognito.exception.ts for full mapping. Solutions:
  1. Wrong Credentials:
    • User resets password via “Forgot Password” flow
    • Admin resets via:
      aws cognito-idp admin-set-user-password \
        --user-pool-id <pool-id> \
        --username [email protected] \
        --password NewPassword123 \
        --permanent
      
  2. User Not Confirmed:
    • Admin confirms user:
      aws cognito-idp admin-confirm-sign-up \
        --user-pool-id <pool-id> \
        --username [email protected]
      
  3. Rate Limited:
    • Wait 15-30 minutes
    • User attempts password reset

Error: InvalidParameterException (Token Verification)

Symptoms:
  • All authenticated requests fail with 401
  • CloudWatch shows InvalidParameterException: Access Token does not contain openid scope
Cause: Cognito User Pool Client not configured for ID tokens. Solution: Verify User Pool Client has openid scope:
aws cognito-idp describe-user-pool-client \
  --user-pool-id <pool-id> \
  --client-id <client-id> \
  --query 'UserPoolClient.AllowedOAuthScopes'
Update in CloudFormation template if missing.

Bedrock Issues

Error: ThrottlingException

Symptoms:
  • Worker Lambda logs show ThrottlingException
  • Jobs fail after 3 retries
Cause: Bedrock model rate limit exceeded. Current Limits (us-east-1):
  • Claude 3.5 Sonnet: 400 requests/minute, 160,000 tokens/minute
Solutions:
  1. Request Quota Increase:
    • AWS Console → Service Quotas → AWS Bedrock
    • Request increase for specific model
  2. Add Backoff: Already implemented with exponential backoff (200ms → 5s).
  3. Reduce Concurrency:
    aws lambda put-function-concurrency \
      --function-name alliance-risk-worker \
      --reserved-concurrent-executions 5
    

Error: ValidationException: Malformed input request

Symptoms:
  • Bedrock call fails immediately
  • No retry attempts
Cause: Invalid prompt structure or missing required fields. Diagnosis: Check CloudWatch for logged request body:
aws logs filter-log-events \
  --log-group-name /aws/lambda/alliance-risk-worker \
  --filter-pattern '"Invoking Bedrock model"' | \
  jq -r '.events[-1].message'
Common Issues:
  • Missing anthropic_version: bedrock-2023-05-31
  • Empty system or messages array
  • Invalid max_tokens (must be 1-4096)
Solution: Verify prompt templates in database:
SELECT section, system_prompt, user_prompt_template 
FROM "Prompt" 
WHERE is_active = true;
Ensure templates contain valid placeholders and content.

Circuit Breaker Open

Error:
CircuitOpenError: Circuit breaker is open
Cause: 3+ consecutive Bedrock failures triggered circuit breaker. Solution: Circuit breaker auto-resets after 60 seconds. Check Bedrock status:
# Check AWS Health Dashboard
aws health describe-events \
  --filter services=BEDROCK \
  --query 'events[?eventTypeCategory==`issue`]'
If Bedrock is healthy, investigate underlying cause (throttling, invalid requests).

Job Processing Issues

Job Stuck in PENDING Status

Symptoms:
  • Job created but never processed
  • Polling shows status: PENDING indefinitely
Diagnosis:
  1. Check if Worker Lambda was invoked:
    aws logs tail /aws/lambda/alliance-risk-worker --since 10m | grep "Processing job"
    
  2. Verify WORKER_FUNCTION_NAME env var on API Lambda:
    aws lambda get-function-configuration \
      --function-name alliance-risk-api \
      --query 'Environment.Variables.WORKER_FUNCTION_NAME'
    
Solutions:
  • Missing env var: Update CloudFormation template
  • IAM permission denied: Add lambda:InvokeFunction to API Lambda role
  • Worker Lambda disabled: Check Lambda console
Manual retry:
JOB_ID="<job-id-from-database>"
aws lambda invoke --function-name alliance-risk-worker \
  --invocation-type Event \
  --payload "{\"jobId\":\"$JOB_ID\"}" \
  /tmp/response.json

Job Fails After Max Attempts

Symptoms:
  • Job status transitions PENDING → PROCESSING → PENDING → FAILED
  • attempts field equals maxAttempts (default 3)
Diagnosis: Check job error message:
SELECT id, type, status, attempts, error 
FROM "Job" 
WHERE status = 'FAILED' 
ORDER BY created_at DESC 
LIMIT 10;
Common Errors:
  • Bedrock throttling (see Bedrock section)
  • Document parsing timeout (increase Worker timeout)
  • Invalid job input (check input field)
Solution: Fix underlying issue, then reset job:
UPDATE "Job" 
SET status = 'PENDING', attempts = 0, error = NULL 
WHERE id = '<job-id>';
Manually invoke Worker Lambda (see above).

Deployment Issues

deploy-api.sh Fails: “Could not describe stack”

Error:
✗ Could not describe stack 'AllianceRiskStack'
Cause: CloudFormation stack doesn’t exist or wrong region. Solution:
  1. Verify stack exists:
    aws cloudformation describe-stacks --stack-name AllianceRiskStack
    
  2. If missing, deploy infrastructure first:
    pnpm --filter @alliance-risk/infra cfn:deploy dev
    
  3. Check AWS region:
    aws configure get region  # Should match stack region
    

Lambda Update Fails: Package Too Large

Error:
Unzipped size must be smaller than 262144000 bytes
Cause: Lambda deployment package exceeds 250MB unzipped limit. Solution: Review EXTERNALS list in scripts/deploy-api.sh. Possible causes:
  • Dev dependencies accidentally copied
  • Large files in node_modules (docs, tests)
  • Multiple versions of same package
Check bundle size:
du -sh /tmp/tmp.*/node_modules
Expected: ~15-20MB

CloudFront Serves Stale Content After Deployment

Symptoms:
  • Web deployment succeeds
  • Frontend shows old code
Cause: CloudFront cache not invalidated. Solution:
  1. Check if invalidation was created:
    aws cloudfront list-invalidations \
      --distribution-id <distribution-id> \
      --query 'InvalidationList.Items[0]'
    
  2. Manually invalidate:
    DISTRIBUTION_ID=$(aws cloudformation describe-stacks \
      --stack-name AllianceRiskStack \
      --query 'Stacks[0].Outputs[?OutputKey==`CloudFrontDistributionId`].OutputValue' \
      --output text)
    
    aws cloudfront create-invalidation \
      --distribution-id $DISTRIBUTION_ID \
      --paths "/*"
    
  3. Wait 2-5 minutes for propagation.

Diagnostic Checklist

When investigating an issue:
  • Check CloudWatch logs for both API and Worker Lambdas
  • Verify all environment variables are set correctly
  • Confirm RDS is running and accessible
  • Check Cognito User Pool and Client configuration
  • Review recent deployments (code changes, infrastructure updates)
  • Check AWS Health Dashboard for service issues
  • Verify IAM permissions for Lambda execution roles
  • Check S3 bucket permissions and CORS configuration

Getting Help

If issues persist:
  1. Gather diagnostics:
    # Export last 1 hour of logs
    aws logs filter-log-events \
      --log-group-name /aws/lambda/alliance-risk-api \
      --start-time $(date -d '1 hour ago' +%s)000 \
      > api-logs.json
    
  2. Document:
    • Steps to reproduce
    • Error messages from CloudWatch
    • Recent code/infrastructure changes
    • Environment (dev/staging/production)
  3. Contact the development team with collected information.

Next Steps

Build docs developers (and LLMs) love