Overview
When a deployment causes issues in production, you may need to rollback to a previous working version. This guide covers different rollback strategies and emergency procedures.Quick Rollback (Recommended)
The fastest way to rollback a deployment is using Kubernetes’ built-in rollout functionality.Rollback a Deployment
Rollback to the previous revision:Check Rollback Status
Monitor the rollback progress:Rollback via Re-deployment
For a more controlled rollback, you can redeploy a previous version through the CI/CD pipeline.Method 1: Redeploy Previous Commit
-
Find the last working commit:
-
Create a new commit that reverts changes:
This triggers a new deployment with the reverted code.
-
Monitor the deployment:
- Watch GitHub Actions workflow
- Check pod status in Kubernetes
- Verify application functionality
Method 2: Update Image Tag
If you know the previous working Docker image tag:-
Update the Kittyhawk config in
k8s/main.ts: -
Generate and apply manifests:
-
Commit the change to document the rollback:
Emergency Procedures
Critical Service Down
If a critical service is completely down:-
Check pod status immediately:
-
View recent events:
-
Check pod logs for errors:
-
Immediate rollback:
-
Verify service recovery:
Database Migration Issues
If a database migration caused problems: ⚠️ WARNING: Rolling back database migrations is complex and risky.-
Rollback application first to stop new traffic:
-
Assess database state:
- Check which migrations were applied
- Determine if data was modified or just schema
- Review migration code for reversibility
-
For Django applications:
-
If migration can’t be rolled back:
- Restore from database backup (see Database Backups)
- Write a new forward migration to fix the issue
- Consult with the team before taking action
Bad Configuration
If a ConfigMap or Secret change caused issues:-
Identify the problematic config:
-
Restore previous version:
-
Restart affected pods:
Memory/CPU Issues
If pods are being OOMKilled or CPU throttled:-
Check resource usage:
-
Temporary fix - scale down:
-
Increase resource limits in Kittyhawk config:
- Redeploy with new limits
Accessing the Cluster
To perform manual rollbacks, you need kubectl access to the cluster.Using AWS IAM Role
Thekubectl IAM role allows cluster access:
-
Assume the kubectl role:
-
Configure kubectl:
-
Verify access:
Using Bastion Host
The bastion instance has kubectl pre-configured:- SSH to bastion (requires SSH key)
- kubectl is available and configured
- Run rollback commands as needed
Via GitHub Actions
For automated rollbacks, you can trigger a rerun of a previous successful workflow:- Go to GitHub Actions tab
- Find the last successful deployment
- Click “Re-run all jobs”
Rollback Checklist
Before Rolling Back
- Identify the issue and confirm rollback is necessary
- Notify team members (Slack, etc.)
- Document the issue and reason for rollback
- Check if data was modified (database migrations, etc.)
- Determine which version to rollback to
During Rollback
- Execute rollback command
- Monitor rollout status
- Check pod logs for errors
- Verify application functionality
- Check monitoring dashboards (Grafana)
- Test critical user flows
After Rollback
- Confirm service is stable
- Update incident documentation
- Create issue to track the bug
- Investigate root cause
- Plan fix for the issue
- Add tests to prevent regression
- Communicate status to stakeholders
Prevention Strategies
Enable Gradual Rollouts
Consider using gradual rollout strategies:Implement Health Checks
Ensure applications have proper health checks:Test Before Deploying
- Run full test suite locally
- Test database migrations on a copy of production data
- Use staging environment if available
- Review deployment diff carefully
- Have another team member review changes
Monitor Deployments
- Watch Grafana dashboards during deployment
- Set up alerts for critical metrics
- Check error rates in Datadog
- Monitor application logs
- Test critical paths after deployment