Rollback Procedures

Overview

When a deployment causes issues in production, you may need to rollback to a previous working version. This guide covers different rollback strategies and emergency procedures.

Quick Rollback (Recommended)

The fastest way to rollback a deployment is using Kubernetes’ built-in rollout functionality.

Rollback a Deployment

Rollback to the previous revision:

kubectl rollout undo deployment/<deployment-name>

Rollback to a specific revision:

# View revision history
kubectl rollout history deployment/<deployment-name>

# Rollback to specific revision
kubectl rollout undo deployment/<deployment-name> --to-revision=<revision-number>

Check Rollback Status

Monitor the rollback progress:

# Watch rollout status
kubectl rollout status deployment/<deployment-name>

# View current pods
kubectl get pods -l app=<app-name>

# Check pod logs
kubectl logs <pod-name>

Rollback via Re-deployment

For a more controlled rollback, you can redeploy a previous version through the CI/CD pipeline.

Method 1: Redeploy Previous Commit

Find the last working commit:
```
git log --oneline
```
Create a new commit that reverts changes:
```
git revert <bad-commit-sha>
git push origin master
```
This triggers a new deployment with the reverted code.
Monitor the deployment:
- Watch GitHub Actions workflow
- Check pod status in Kubernetes
- Verify application functionality

Method 2: Update Image Tag

If you know the previous working Docker image tag:

Update the Kittyhawk config in k8s/main.ts:

new DjangoApplication(this, 'django-asgi', {
  deployment: {
    image: 'pennlabs/my-app-backend',
    tag: 'previous-working-sha', // Specify old tag
    // ... other config
  },
});

Generate and apply manifests:

cd k8s
export GIT_SHA=previous-working-sha
export RELEASE_NAME=my-app
yarn build

# Apply to cluster
kubectl apply -f dist/

Commit the change to document the rollback:

git add k8s/main.ts
git commit -m "Rollback to previous version due to <issue>"
git push origin master

Emergency Procedures

Critical Service Down

If a critical service is completely down:

Check pod status immediately:

kubectl get pods -l app.kubernetes.io/part-of=<release-name>

View recent events:

kubectl get events --sort-by='.lastTimestamp' | head -20

Check pod logs for errors:

kubectl logs <pod-name> --previous  # Logs from crashed container
kubectl logs <pod-name>              # Current logs

Immediate rollback:

kubectl rollout undo deployment/<deployment-name>

Verify service recovery:

# Check pods are running
kubectl get pods -l app=<app-name>

# Test the service
curl https://<app-domain>.pennlabs.org/health

Database Migration Issues

If a database migration caused problems: ⚠️ WARNING: Rolling back database migrations is complex and risky.

Rollback application first to stop new traffic:

kubectl rollout undo deployment/<deployment-name>

Assess database state:
- Check which migrations were applied
- Determine if data was modified or just schema
- Review migration code for reversibility

For Django applications:

# Connect to a pod
kubectl exec -it <pod-name> -- bash

# Check migration status
python manage.py showmigrations

# Rollback migration (if safe)
python manage.py migrate <app_name> <previous_migration>

If migration can’t be rolled back:
- Restore from database backup (see Database Backups)
- Write a new forward migration to fix the issue
- Consult with the team before taking action

Bad Configuration

If a ConfigMap or Secret change caused issues:

Identify the problematic config:

kubectl get configmap <name> -o yaml
kubectl get secret <name> -o yaml

Restore previous version:

# If you have the previous YAML
kubectl apply -f previous-config.yaml

# Or edit directly
kubectl edit configmap <name>
kubectl edit secret <name>

Restart affected pods:

kubectl rollout restart deployment/<deployment-name>

Memory/CPU Issues

If pods are being OOMKilled or CPU throttled:

Check resource usage:

kubectl top pods
kubectl describe pod <pod-name> | grep -A5 "Limits\|Requests"

Temporary fix - scale down:

kubectl scale deployment/<deployment-name> --replicas=1

Increase resource limits in Kittyhawk config:

new DjangoApplication(this, 'django-asgi', {
  deployment: {
    // ... other config
    resources: {
      limits: {
        memory: '2Gi',  // Increased from 1Gi
        cpu: '1000m',   // Increased from 500m
      },
    },
  },
});

Redeploy with new limits

Accessing the Cluster

To perform manual rollbacks, you need kubectl access to the cluster.

Using AWS IAM Role

The kubectl IAM role allows cluster access:

Assume the kubectl role:

aws sts assume-role \
  --role-arn arn:aws:iam::<AWS_ACCOUNT_ID>:role/kubectl \
  --role-session-name rollback-session

Configure kubectl:

aws eks --region us-east-1 update-kubeconfig \
  --name production \
  --role-arn arn:aws:iam::<AWS_ACCOUNT_ID>:role/kubectl

Verify access:
```
kubectl get nodes
```

Using Bastion Host

The bastion instance has kubectl pre-configured:

SSH to bastion (requires SSH key)
kubectl is available and configured
Run rollback commands as needed

Via GitHub Actions

For automated rollbacks, you can trigger a rerun of a previous successful workflow:

Go to GitHub Actions tab
Find the last successful deployment
Click “Re-run all jobs”

This redeploys the previous version.

Rollback Checklist

Before Rolling Back

Identify the issue and confirm rollback is necessary
Notify team members (Slack, etc.)
Document the issue and reason for rollback
Check if data was modified (database migrations, etc.)
Determine which version to rollback to

During Rollback

After Rollback

Prevention Strategies

Enable Gradual Rollouts

Consider using gradual rollout strategies:

new DjangoApplication(this, 'django-asgi', {
  deployment: {
    // ... other config
    strategy: {
      type: 'RollingUpdate',
      rollingUpdate: {
        maxSurge: 1,        // Max pods above desired count
        maxUnavailable: 0,  // Keep all pods available during update
      },
    },
  },
});

Implement Health Checks

Ensure applications have proper health checks:

new DjangoApplication(this, 'django-asgi', {
  deployment: {
    // ... other config
    readinessProbe: {
      httpGet: {
        path: '/api/health/',
        port: 8000,
      },
      initialDelaySeconds: 10,
      periodSeconds: 5,
    },
    livenessProbe: {
      httpGet: {
        path: '/api/health/',
        port: 8000,
      },
      initialDelaySeconds: 30,
      periodSeconds: 10,
    },
  },
});

Test Before Deploying

Run full test suite locally
Test database migrations on a copy of production data
Use staging environment if available
Review deployment diff carefully
Have another team member review changes

Monitor Deployments

Watch Grafana dashboards during deployment
Set up alerts for critical metrics
Check error rates in Datadog
Monitor application logs
Test critical paths after deployment

Common Rollback Scenarios

Scenario 1: New feature causes 500 errors

# Quick rollback
kubectl rollout undo deployment/my-app-django-asgi

# Monitor
kubectl rollout status deployment/my-app-django-asgi

# Verify
curl https://myapp.pennlabs.org/api/health/

Scenario 2: Database migration failed

# Rollback app first
kubectl rollout undo deployment/my-app-django-asgi

# Then fix migration
kubectl exec -it <pod-name> -- python manage.py migrate <app> <previous-migration>

# Or restore from backup if needed
# (See Database Backups documentation)

Scenario 3: Wrong environment variable

# Fix the secret
kubectl edit secret my-app

# Restart pods to pick up change
kubectl rollout restart deployment/my-app-django-asgi

Deployment

Monitoring

Maintenance

Overview

Quick Rollback (Recommended)

Rollback a Deployment

Check Rollback Status

Rollback via Re-deployment

Method 1: Redeploy Previous Commit

Method 2: Update Image Tag

Emergency Procedures

Critical Service Down

Database Migration Issues

Bad Configuration

Memory/CPU Issues

Accessing the Cluster

Using AWS IAM Role

Using Bastion Host

Via GitHub Actions

Rollback Checklist

Before Rolling Back

During Rollback

After Rollback

Prevention Strategies

Enable Gradual Rollouts

Implement Health Checks

Test Before Deploying

Monitor Deployments

Common Rollback Scenarios

Scenario 1: New feature causes 500 errors

Scenario 2: Database migration failed

Scenario 3: Wrong environment variable

Additional Resources

Build docs developers (and LLMs) love

Deployment

Monitoring

Maintenance

​Overview

​Quick Rollback (Recommended)

​Rollback a Deployment

​Check Rollback Status

​Rollback via Re-deployment

​Method 1: Redeploy Previous Commit

​Method 2: Update Image Tag

​Emergency Procedures

​Critical Service Down

​Database Migration Issues

​Bad Configuration

​Memory/CPU Issues

​Accessing the Cluster

​Using AWS IAM Role

​Using Bastion Host

​Via GitHub Actions

​Rollback Checklist

​Before Rolling Back

​During Rollback

​After Rollback

​Prevention Strategies

​Enable Gradual Rollouts

​Implement Health Checks

​Test Before Deploying

​Monitor Deployments

​Common Rollback Scenarios

​Scenario 1: New feature causes 500 errors

​Scenario 2: Database migration failed

​Scenario 3: Wrong environment variable

​Additional Resources

Build docs developers (and LLMs) love

Overview

Quick Rollback (Recommended)

Rollback a Deployment

Check Rollback Status

Rollback via Re-deployment

Method 1: Redeploy Previous Commit

Method 2: Update Image Tag

Emergency Procedures

Critical Service Down

Database Migration Issues

Bad Configuration

Memory/CPU Issues

Accessing the Cluster

Using AWS IAM Role

Using Bastion Host

Via GitHub Actions

Rollback Checklist

Before Rolling Back

During Rollback

After Rollback

Prevention Strategies

Enable Gradual Rollouts

Implement Health Checks

Test Before Deploying

Monitor Deployments

Common Rollback Scenarios

Scenario 1: New feature causes 500 errors

Scenario 2: Database migration failed

Scenario 3: Wrong environment variable

Additional Resources