Cluster Upgrades - Penn Labs Infrastructure

Overview

The production EKS cluster requires periodic upgrades to maintain security, stability, and access to new Kubernetes features. This guide covers the upgrade process for both the EKS control plane and worker nodes.

Current Configuration

Production cluster configuration (/home/daytona/workspace/source/terraform/eks.tf:1-26):

module "eks-production" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "18.4.0"
  cluster_name    = local.k8s_cluster_name
  cluster_version = "1.23"
  subnet_ids      = module.vpc.private_subnets
  vpc_id          = module.vpc.vpc_id
  eks_managed_node_groups = {
    spot = {
      desired_size = local.k8s_cluster_size
      max_size     = local.k8s_cluster_size
      min_size     = local.k8s_cluster_size

      create_launch_template = false
      launch_template_name   = ""
      disk_size              = 50
      instance_types         = ["r5d.xlarge"]
      capacity_type          = "SPOT"
    }
  }
  tags = {
    created-by = "terraform"
  }
}

Key details:

Current version: Kubernetes 1.23
Terraform module: terraform-aws-modules/eks/aws v18.4.0
Node type: Spot instances (r5d.xlarge)
Capacity: Fixed size (desired = max = min)

Upgrade Planning

Version Support Policy

AWS supports the latest three Kubernetes minor versions. Clusters must be upgraded at least once per year to stay within support.

EKS version support:

Active support: Latest 3 minor versions
Extended support: Additional 12 months (with fees)
End of support: No security patches or support

Pre-Upgrade Checklist

Review Kubernetes changelog for breaking changes
Check API deprecations - Some APIs are removed in new versions
Test in staging environment if available
Backup cluster state - Export critical resources
Review addon compatibility - cert-manager, ingress controllers, etc.
Schedule maintenance window - Coordinate with team
Verify node capacity - Ensure enough nodes for workload during upgrade

Version Skipping

EKS does not support skipping minor versions. You must upgrade sequentially (1.23 → 1.24 → 1.25).

Control Plane Upgrade

Procedure

The control plane upgrade is managed by AWS and happens automatically when you update the Terraform configuration.

1. Update Terraform Configuration

Edit terraform/eks.tf:

module "eks-production" {
  ...
  cluster_version = "1.24"  # Increment by one minor version
  ...
}

2. Review Terraform Plan

cd terraform
terraform plan -target=module.eks-production

Verify:

Only cluster version is changing
No node group recreations
No unexpected resource deletions

3. Apply Upgrade

terraform apply -target=module.eks-production

Expected behavior:

Control plane upgrade takes 20-45 minutes
API server may have brief interruptions
Running pods are not affected

4. Verify Control Plane

kubectl version --short
kubectl get nodes

Control plane version should show new version, nodes still show old version.

Control Plane Upgrade Notes

Zero downtime: Multiple API servers upgraded one at a time
API compatibility: Old kubelet versions (N-2) still supported
Automatic rollback: AWS automatically rolls back on failure
No workload impact: Pods continue running during upgrade

Node Upgrade

Strategy

The cluster uses fixed-size node groups (min = desired = max). Node upgrades require temporary capacity increase to avoid downtime.

Option 1: Rolling Node Upgrade (Recommended)

This approach creates new nodes before draining old ones.

1. Increase Node Group Capacity

Edit terraform/eks.tf to temporarily increase capacity:

eks_managed_node_groups = {
  spot = {
    desired_size = local.k8s_cluster_size * 2  # Double capacity
    max_size     = local.k8s_cluster_size * 2
    min_size     = local.k8s_cluster_size
    ...
  }
}

Apply:

terraform apply -target=module.eks-production

2. Update Node AMI Version

The EKS module automatically selects the AMI for the cluster version. Force node group refresh:

eks_managed_node_groups = {
  spot = {
    ...
    # Add or update this to force node replacement
    update_config = {
      max_unavailable_percentage = 50
    }
  }
}

3. Apply Node Upgrade

terraform apply -target=module.eks-production

AWS will:

Launch new nodes with updated AMI
Wait for new nodes to be Ready
Cordon old nodes
Drain pods from old nodes
Terminate old nodes

4. Monitor Node Replacement

# Watch node status
kubectl get nodes -w

# Check pod distribution
kubectl get pods -o wide --all-namespaces

5. Restore Original Capacity

After all nodes are upgraded:

eks_managed_node_groups = {
  spot = {
    desired_size = local.k8s_cluster_size  # Restore original size
    max_size     = local.k8s_cluster_size
    min_size     = local.k8s_cluster_size
    ...
  }
}

terraform apply -target=module.eks-production

Option 2: Blue/Green Node Group Upgrade

Create an entirely new node group with new version.

1. Create New Node Group

eks_managed_node_groups = {
  spot = { ... }  # Existing nodes
  
  spot-new = {
    desired_size = local.k8s_cluster_size
    max_size     = local.k8s_cluster_size
    min_size     = local.k8s_cluster_size
    disk_size    = 50
    instance_types = ["r5d.xlarge"]
    capacity_type  = "SPOT"
  }
}

2. Apply and Verify New Nodes

terraform apply -target=module.eks-production
kubectl get nodes

3. Drain Old Nodes

for node in $(kubectl get nodes -l eks.amazonaws.com/nodegroup=spot -o name); do
  kubectl drain $node --ignore-daemonsets --delete-emptydir-data --timeout=300s
done

4. Remove Old Node Group

eks_managed_node_groups = {
  spot-new = { ... }  # Rename to 'spot' if desired
}

terraform apply -target=module.eks-production

Post-Upgrade Verification

1. Verify Cluster Health

# Check node status
kubectl get nodes

# Verify all nodes are Ready and on new version
kubectl get nodes -o wide

# Check system pods
kubectl get pods -n kube-system

2. Verify Workloads

# Check all pods are running
kubectl get pods --all-namespaces | grep -v Running

# Verify services are accessible
kubectl get svc --all-namespaces

3. Test Application Functionality

Run smoke tests on critical applications
Verify ingress and load balancers
Check certificate renewals
Test database connections

4. Update Addons

After cluster upgrade, update cluster addons:

# Check addon versions
aws eks describe-addon-versions --kubernetes-version 1.24

# Update kube-proxy
aws eks update-addon --cluster-name <cluster-name> --addon-name kube-proxy --addon-version <version>

# Update vpc-cni
aws eks update-addon --cluster-name <cluster-name> --addon-name vpc-cni --addon-version <version>

# Update coredns
aws eks update-addon --cluster-name <cluster-name> --addon-name coredns --addon-version <version>

Spot Instance Considerations

The cluster uses Spot instances which can be interrupted. Account for this during upgrades.

Spot Interruption Handling

The cluster has AWS Node Termination Handler installed (/home/daytona/workspace/source/terraform/eks.tf:77-85):

resource "helm_release" "aws-node-termination-handler" {
  name       = "aws-node-termination-handler"
  repository = "https://aws.github.io/eks-charts"
  chart      = "aws-node-termination-handler"
  version    = "0.13.0"
  namespace  = "kube-system"
  values     = [file("helm/aws-node-termination-handler.yaml")]
}

This handler:

Monitors for Spot interruption notices
Cordons nodes before termination
Drains pods gracefully (120s default)
Prevents new pods from scheduling on interrupted nodes

Best Practices for Spot

Upgrade during low traffic - Reduces impact of spot interruptions
Monitor spot availability - Check instance availability in region
Use multiple instance types - Increase spot capacity pool
Set pod disruption budgets - Ensure minimum replicas during disruptions

Rollback Procedures

Control Plane Rollback

EKS does not support control plane downgrades. Rollback requires cluster recreation.

If control plane upgrade fails:

AWS automatic rollback - Control plane auto-reverts on failure
Manual intervention - Contact AWS Support if stuck

Node Rollback

If nodes fail after upgrade:

Quick Rollback (if old nodes still exist)

Uncordon old nodes:
```
kubectl uncordon <old-node-name>
```

Drain new nodes:

kubectl drain <new-node-name> --ignore-daemonsets --delete-emptydir-data

Revert Terraform:

git revert <commit-hash>
terraform apply -target=module.eks-production

Troubleshooting

Nodes Not Joining Cluster

Check node logs:

aws ssm start-session --target <instance-id>
sudo journalctl -u kubelet -n 100

Common issues:

IAM role trust relationship incorrect
Security group blocking kubelet communication
aws-auth ConfigMap not updated

Pods Not Scheduling

Check node taints:

kubectl describe node <node-name> | grep Taints

Check pod events:

kubectl describe pod <pod-name>

API Server Unavailable

Check EKS cluster status:

aws eks describe-cluster --name <cluster-name> --query 'cluster.status'

If “UPDATING” - wait for upgrade to complete (up to 45 minutes) If “FAILED” - AWS will auto-rollback, contact support if needed

Best Practices

Stay current - Upgrade within 2 minor versions of latest
Test thoroughly - Use staging environment for validation
Upgrade incrementally - One minor version at a time
Schedule wisely - Low traffic periods, avoid Fridays
Monitor closely - Watch metrics during and after upgrade
Document changes - Note any configuration changes required
Communicate - Inform team of maintenance windows
Backup first - Export critical resources before upgrade
PodDisruptionBudgets - Ensure critical apps have PDBs
Addon compatibility - Verify all addons support new version

Emergency Contacts

AWS Support: Use AWS Support Console for EKS issues
On-call engineer: Check internal documentation
Escalation: Follow incident response procedures

Deployment

Monitoring

Maintenance

​Overview

​Current Configuration

​Upgrade Planning

​Version Support Policy

​Pre-Upgrade Checklist

​Version Skipping

​Control Plane Upgrade

​Procedure

​1. Update Terraform Configuration

​2. Review Terraform Plan

​3. Apply Upgrade

​4. Verify Control Plane

​Control Plane Upgrade Notes

​Node Upgrade

​Strategy

​Option 1: Rolling Node Upgrade (Recommended)

​1. Increase Node Group Capacity

​2. Update Node AMI Version

​3. Apply Node Upgrade

​4. Monitor Node Replacement

​5. Restore Original Capacity

​Option 2: Blue/Green Node Group Upgrade

​1. Create New Node Group

​2. Apply and Verify New Nodes

​3. Drain Old Nodes

​4. Remove Old Node Group

​Post-Upgrade Verification

​1. Verify Cluster Health

​2. Verify Workloads

​3. Test Application Functionality

​4. Update Addons

​Spot Instance Considerations

​Spot Interruption Handling

​Best Practices for Spot

​Rollback Procedures

​Control Plane Rollback

​Node Rollback

​Quick Rollback (if old nodes still exist)

​Troubleshooting

​Nodes Not Joining Cluster

​Pods Not Scheduling

​API Server Unavailable

​Best Practices

​Emergency Contacts

Build docs developers (and LLMs) love

Overview

Current Configuration

Upgrade Planning

Version Support Policy

Pre-Upgrade Checklist

Version Skipping

Control Plane Upgrade

Procedure

1. Update Terraform Configuration

2. Review Terraform Plan

3. Apply Upgrade

4. Verify Control Plane

Control Plane Upgrade Notes

Node Upgrade

Strategy

Option 1: Rolling Node Upgrade (Recommended)

1. Increase Node Group Capacity

2. Update Node AMI Version

3. Apply Node Upgrade

4. Monitor Node Replacement

5. Restore Original Capacity

Option 2: Blue/Green Node Group Upgrade

1. Create New Node Group

2. Apply and Verify New Nodes

3. Drain Old Nodes

4. Remove Old Node Group

Post-Upgrade Verification

1. Verify Cluster Health

2. Verify Workloads

3. Test Application Functionality

4. Update Addons

Spot Instance Considerations

Spot Interruption Handling

Best Practices for Spot

Rollback Procedures

Control Plane Rollback

Node Rollback

Quick Rollback (if old nodes still exist)

Troubleshooting

Nodes Not Joining Cluster

Pods Not Scheduling

API Server Unavailable

Best Practices

Emergency Contacts