Red Signals
Posts
Mastering Amazon EKS Upgrades: The Ultimate Senior-Level Guide

Mastering Amazon EKS Upgrades: The Ultimate Senior-Level Guide

A step-by-step blueprint to upgrading EKS clusters with zero downtime, full automation, and enterprise-grade best practices.

Ismail Kovvuru
June 15, 2025

Upgrading Amazon EKS involves a detailed process of pre-planning, automation, validation, high availability, zero-downtime strategies, rollback handling, and continuous improvement. For DevOps Engineer, this blog provides a structured, enterprise-grade approach divided into three phases: Pre-Upgrade Planning, High Availability & Execution, and Post-Upgrade Validation & Automation.

1. What is an EKS Upgrade?

Amazon Elastic Kubernetes Service (EKS) is a managed service that runs Kubernetes clusters on AWS. An EKS upgrade is the process of updating the Kubernetes control plane and worker nodes to a newer Kubernetes version.

📌 Includes:

Control Plane Upgrade (managed by AWS)
Managed Node Groups Upgrade (your responsibility)
Add-ons Upgrade: CoreDNS, kube-proxy, CNI, etc.
Compatibility Validation: APIs, CRDs, Helm charts

2. Why Are EKS Upgrades Critical?

Reason	Impact
Security Fixes	Patches critical vulnerabilities
Performance Gains	Leverages improvements and optimizations
Deprecated APIs	Prevents application failures
Support Lifecycle	AWS supports only 3 versions at a time
Add-on Compatibility	Ensures VPC CNI, CoreDNS work properly

Failing to upgrade can result in service downtime, security risk, or cluster being unsupported.

3. When to Perform an Upgrade

Every 3-6 months or when:
- AWS announces EKS deprecation
- Kubernetes releases a new minor version
- Helm/CRDs show compatibility issues

Example Lifecycle:

v1.27 → GA (now)
v1.26 → Supported
v1.25 → Supported
v1.24 → Deprecated → Must upgrade

A simple table showing EKS version → Kubernetes version → supported add-on versions.

EKS Version	K8s Version	CoreDNS	kube-proxy	VPC CNI
1.27	1.27.x	1.10.x	v1.27.x	v1.12.x
1.28	1.28.x	1.11.x	v1.28.x	v1.13.x

Helps engineers quickly identify if their cluster is compatible before upgrading.

4. Upgrade Overview & Architecture

📍 Control Plane: AWS-managed (upgrade manually triggered)
📍 Node Groups: You manage, must recreate or upgrade
📍 Workloads: Run tests in staging before prod
📍 Add-ons: Upgrade to latest compatible version

⚠️ Important: Always upgrade Control Plane → Node Groups → Add-ons → Workloads

5. Pre-Upgrade Planning, Backup & Validation

🧠 Why It’s Crucial

To ensure data integrity, rollback capability, and compatibility.

✅ Key Steps:

🔹 1. Backup Everything

# Backup etcd-level resources using Velero
velero install --provider aws --bucket eks-backups --backup-location-config region=us-east-1
velero backup create pre-upgrade-backup

# Snapshot EBS volumes (optional for PVs)
aws ec2 create-snapshot --volume-id vol-xyz

Encrypt S3 bucket, enable cross-region replication.

🔹 2. Review Release Notes

# EKS & Kubernetes version notes
https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html

🔹 3. Scan for Deprecated APIs

# kubent: Detect removed/deprecated APIs
kubent

# Pluto: Scan Helm releases
pluto detect-helm -o markdown

🔹 4. Test in a Staging Cluster

eksctl create cluster --name staging-cluster --version 1.27 --region us-east-1 \
--nodegroup-name staging-nodes --nodes 2

Deploy current workloads, test integrations, validate metrics, simulate DR.

🔹 5. Define a Change Plan

Stakeholder communication
Rollback procedures
Maintenance window

6. High Availability & Zero Downtime Upgrade

🧠 Why It’s Crucial

To prevent outages, data loss, and SLA violations.

✅ Step-by-Step

🔸 1. Ensure HA Architecture

Multi-AZ node groups
Multi-AZ ALB/NLB

🔸 2. Harden Workloads

# Set PodDisruptionBudgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Add anti-affinity rules, readiness/liveness probes.

🔸 3. Upgrade Control Plane

eksctl upgrade cluster --name prod-cluster --version 1.27 --approve

🔸 4. Upgrade Node Groups

eksctl upgrade nodegroup --name ng-app --cluster prod-cluster --kubernetes-version 1.27 --approve

Use rolling updates, maxUnavailable=1

🔸 5. Upgrade Add-ons

# CoreDNS
eksctl update addon --name coredns --cluster prod-cluster --force

# VPC CNI
eksctl update addon --name vpc-cni --cluster prod-cluster --force

# kube-proxy
eksctl update addon --name kube-proxy --cluster prod-cluster --force

🔸 6. Monitor Everything

Use CloudWatch, Prometheus, Grafana, Loki
Set alert rules for app health, latency, error rate

7. Post-Upgrade Validation, Rollback & Automation

Post-Upgrade Validation, Rollback & Automation Health checks, logging, rollback triggers

✅ 1. Validate Workloads

kubectl get pods -A
kubectl logs -n app-namespace my-app-xxxx
kubectl top pod -n app-namespace

Validate horizontal/vertical pod autoscaler behavior
Validate ingress, DNS, metrics, dashboards

✅ 2. Rollback Strategy

# Helm rollback
helm rollback my-release 1

# Revert GitOps manifest
git revert HEAD
kubectl apply -f manifests/

# Recreate nodegroup with older AMI
eksctl create nodegroup --version 1.26 ...

✅ 3. Post-Upgrade Hygiene

Delete old nodegroups
Clean up deprecated CRDs
Update runbooks, CI/CD, dashboards
Document lessons learned

8. Automation Example: GitHub Actions

name: EKS Upgrade Pipeline
on:
  workflow_dispatch:

jobs:
  upgrade:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2

      - name: Upgrade Control Plane
        run: |
          eksctl upgrade cluster --name prod --version 1.27 --approve

      - name: Upgrade Addons
        run: |
          eksctl update addon --name coredns --cluster prod --force

9. EKS Upgrade Process Flowchart

+------------------------+
|  Pre-Upgrade Planning  |
+------------------------+
           |
           v
+-----------------------------+
|  Backup & Compatibility     |
|  - eksctl snapshot          |
|  - Velero backup            |
|  - kubent/pluto scan        |
+-----------------------------+
           |
           v
+-----------------------------+
|  Stage & Validate in Staging|
|  - Mirror prod workloads    |
|  - Test DR & compatibility  |
+-----------------------------+
           |
           v
+----------------------------+
|  Control Plane Upgrade     |
|  - eksctl upgrade cluster  |
+----------------------------+
           |
           v
+------------------------------+
|  Managed Node Group Upgrade  |
|  - Rolling updates           |
|  - PodDisruptionBudgets      |
+------------------------------+
           |
           v
+-----------------------------+
|  Add-ons Upgrade            |
|  - CoreDNS, kube-proxy      |
|  - VPC CNI, CSI drivers     |
+-----------------------------+
           |
           v
+-----------------------------+
|  Post-Validation & Cleanup  |
|  - Check logs, probes       |
|  - Remove old node groups   |
+-----------------------------+
           |
           v
+-----------------------------+
|  Automation & Observability |
|  - CI/CD, GitOps            |
|  - Monitoring & alerts      |
+-----------------------------+

10. Final Thoughts & Best Practices

Practice	Why It Matters
Always test in staging	Prevents breaking prod
Automate with GitOps	Repeatability and rollback
Monitor aggressively	Detect regressions fast
Document the upgrade	For compliance & future use
Follow AWS upgrade announcements	Stay ahead of deprecations

Upgrading EKS is more than just a version bump. It is a controlled, observable, secure, and team-aligned DevOps lifecycle that should be automated, reproducible, and rollback-safe. Follow these practices, and you'll not only ace upgrades but also scale your Kubernetes architecture with confidence.

Also Read : CrashLoopBackOff with No Logs - Fix Guide for Kubernetes with YAML & CI/CD

📚 References

If you found this guide helpful and want to dive deeper into Kubernetes, DevOps, and cloud-native strategies, follow my work here: Medium and Dev.to