Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub

You've got an EKS cluster running Cluster Autoscaler. It [[[[[[[[[[[[works)](/blog/vector-database-comparison-2026-what-actually-works-in)](/blog/vector-data...

karpenter migration from cluster autoscaler nodepool ec2nodeclass tutorial
By Nishaant Dixit
Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub

Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub

Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub

You've got an EKS cluster running Cluster Autoscaler. It [[[[[[[[[[[[works](/articles/vector-database-comparison-2026-what-actually-works-in)](/articles/vector-database-comparison-2026-what-actually-works-in)](/articles/vector-database-comparison-2026-what-actually-works-in)](/articles/vector-database-comparison-2026-what-actually-works-in)](/articles/vector-database-comparison-2026-what-actually-works-in)](/articles/vector-database-comparison-2026-what-actually-works-in)](/articles/what-is-the-10-20-70-rule-for-ai-the-only-framework-that)](/articles/what-is-the-10-20-70-rule-for-ai-the-only-framework-that)](/articles/temporal-workflow-engine-comparison-what-actually-works-in)](/articles/temporal-workflow-engine-comparison-what-actually-works-in)](/articles/temporal-workflow-engine-comparison-what-actually-works-in). Mostly. But those node groups feel like straitjackets — you're paying for instances you don't need, waiting minutes for nodes to spin up, and your SRE team keeps patching ASG configurations by hand.

I've been there. In 2023, I watched a team burn three weeks tuning Cluster Autoscaler for a Spark workload. They never got it right. Switched to Karpenter in three days. [Their bill dropped 32%.

Here's the thing: this Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub isn't just a search query — it's a [[[[[[[[[[playbook](/articles/karpenter-slack-the-karpenter-channel-and-kubernetes)](/articles/karpenter-slack-the-karpenter-channel-and-kubernetes)](/articles/karpenter-slack-the-karpenter-channel-and-kubernetes)](/articles/karpenter-slack-the-karpenter-channel-and-kubernetes)](/articles/karpenter-slack-the-karpenter-channel-and-kubernetes)](/articles/karpenter-slack-the-karpenter-channel-and-kubernetes)](/articles/karpenter-slack-the-karpenter-channel-and-kubernetes)](/articles/karpenter-slack-the-karpenter-channel-and-kubernetes)](/articles/karpenter-slack-the-karpenter-channel-and-kubernetes). This article is that playbook. We'll cover the real mechanics, the gotchas that'll bite you (like that nasty port 8443 error), and the exact code you need from this Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub.

By the end, you'll know how to migrate a production cluster without a prolonged outage, what EC2NodeClass configuration actually matters, and why Karpenter's approach to NodePool scheduling will change how you think about capacity. This Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub is complete with every command and YAML manifest you need.

Let's cut the theory. This is a Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub, not a science project.


What Is Karpenter? (And Why Cluster Autoscaler Isn't Enough)

Karpenter is an open-source node lifecycle manager built by AWS. It launched in 2021 as a direct response to Cluster Autoscaler's limitations. The fundamental difference: Cluster Autoscaler manages node groups; Karpenter manages nodes directly.

Think of Cluster Autoscaler as a property manager who can only add or remove entire buildings (ASGs). Karpenter is a contractor who builds rooms on demand, exactly to your specs, and demolishes them when empty.

According to Karpenter's documentation, the key distinction is that Karpenter "launches the right compute resources for pods that are pending in the cluster." It's not scaling groups — it's scheduling nodes.

What that means in practice:

Capability Cluster Autoscaler Karpenter
Instance diversity Fixed ASG types Any EC2 type on demand
Scale-up speed 3-5 minutes 30-90 seconds
Spot integration Manual ASG config Built-in, per-pod
Node termination Based on ASG metrics Instant, intelligent
Cost optimization None built-in Automatic via instance selection

Cluster Autoscaler isn't broken. It's just old. It was designed when Kubernetes clusters had a dozen nodes. Karpenter was built for the 2025 reality where clusters run hundreds of node types across multiple instance families.


The Architecture Shift: NodePool vs. Auto Scaling Groups

This is where most people get confused in their Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub. Let me clear it up.

Cluster Autoscaler works with Auto Scaling Groups (ASGs). You have an ASG for your general compute, another for memory-optimized, maybe one for GPUs. Each ASG is a bucket of nodes with identical specs.

Karpenter replaces that with NodePools and EC2NodeClasses:

  • NodePool: The scheduling policy. Defines what pods can schedule on nodes provisioned by this pool, taints, labels, and consolidation settings.
  • EC2NodeClass: The hardware spec. Defines AMI family, subnet tags, security groups, instance profile, and block device mappings.

The genius here is separation. One NodePool can reference multiple EC2NodeClasses, or vice versa. Your "GPU workloads" NodePool can pick from p3, p4, or g5 instances based on what's available and cheapest.

Most people think you need one NodePool per workload type. You don't. Start with one. Add complexity only when the single pool can't handle your constraints.


Prerequisites for Migration

Before you touch anything for your Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub, you need:

  • kubectl access to your EKS cluster
  • Helm 3.8+ installed locally
  • AWS CLI with permissions to create IAM roles, SQS queues, and EC2 resources
  • Existing Cluster Autoscaler deployment — don't remove it until Karpenter is proven
  • Instance family quotas that cover your peak demand

Here's a checklist I run before every Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub:

  1. List all current ASGs and their instance types
  2. Document pod scheduling constraints (nodeSelectors, affinity rules)
  3. Check IRSA/IAM roles for node instances
  4. Verify subnet and security group names/tags
  5. Enable OIDC provider for IAM roles for service accounts
  6. Snapshot your Cluster Autoscaler config

Missing step 4 is the most common failure point I've seen. Karpenter discovers subnets and security groups by tag, not by name. If your VPC doesn't have consistent tag conventions, fix that first.


Step-by-Step Migration: The SIVARO Playbook

Step 1: Install Karpenter via Helm

bash

Add the Helm repo

helm repo add karpenter https://charts.karpenter.sh
helm repo update

Set variables

export CLUSTER_NAME="your-eks-cluster"
export CLUSTER_ENDPOINT=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.endpoint" --output text)
export KARPENTER_VERSION="1.2.3"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)

Create the Karpenter IAM role

eksctl create iamserviceaccount
--cluster $CLUSTER_NAME
--name karpenter
--namespace karpenter
--role-name "karpenter-controller-role"
--attach-policy-arn "arn:aws:iam::$AWS_ACCOUNT_ID:policy/KarpenterControllerPolicy"
--role-only
--approve

Install Karpenter

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter
--namespace karpenter
--create-namespace
--set serviceAccount.annotations."eks.amazonaws.com/role-arn"=arn:aws:iam::$AWS_ACCOUNT_ID:role/karpenter-controller-role
--set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile
--set settings.aws.clusterName=$CLUSTER_NAME
--set settings.aws.clusterEndpoint=$CLUSTER_ENDPOINT
--set settings.aws.interruptionQueueName=$CLUSTER_NAME
--version $KARPENTER_VERSION

The port 8443 gotcha: You might hit an error like "no service port 8443 found for service 'karpenter'". This is a known issue documented in the Karpenter provider-aws GitHub. It typically happens when your webhook port configuration doesn't match the service definition. The fix is to explicitly set the webhook port in your Helm values:

yaml

values-override.yaml

webhook:
port: 8443
hostNetwork: true

Apply it with --values values-override.yaml.

Step 2: Create Your EC2NodeClass

This is the most critical resource you'll write in your Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub. Get it wrong, and nodes won't come up. I've debugged this for teams where subnets were tagged incorrectly and Karpenter spent 10 minutes trying to provision in a private subnet with no route to the internet.

yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
role: "KarpenterNodeRole"
subnetSelectorTerms:

  • tags:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
    securityGroupSelectorTerms:
  • tags:
    kubernetes.io/cluster/${CLUSTER_NAME}: owned
    blockDeviceMappings:
  • deviceName: /dev/xvda
    ebs:
    volumeSize: 100Gi
    volumeType: gp3
    iops: 3000
    throughput: 125
    userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="BOUNDARY"

--BOUNDARY
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
echo "Custom setup script here"
--BOUNDARY--

Key things I've learned the hard way during a Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub:

  • Don't use subnet IDs directly. Use tags. Karpenter refreshes available subnets every few minutes — if you hardcode an ID and that subnet disappears, you're dead in the water.
  • Set volumeSize explicitly. The default is 20GB. That's not enough for most container images.
  • Your security group selector should match the tag your cluster uses. EKS creates tags like kubernetes.io/cluster/your-cluster: owned.

Step 3: Define Your NodePool

yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:

  • key: kubernetes.io/arch
    operator: In
    values: ["amd64"]
  • key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]
  • key: karpenter.k8s.aws/instance-category
    operator: In
    values: ["m", "c", "r"]
  • key: karpenter.k8s.aws/instance-generation
    operator: Gt
    values: ["4"]
    nodeClassRef:
    name: default
    limits:
    cpu: 1000
    disruption:
    consolidationPolicy:
    WhenUnderused
    expireAfter: 720h

Notice what's not here: instance types. Karpenter dynamically picks them. The requirements block acts as a filter. I start broad — m, c, r families — and narrow only if specific workloads need GPU or ARM.

The consolidationPolicy: WhenUnderused is magic. Karpenter will look for nodes running below capacity and either consolidate pods onto fewer nodes or swap a spot instance for a cheaper one. According to a Reddit discussion, teams frequently see 15-25% cost reduction just from this setting alone.

Step 4: Validate Karpenter Is Working

Before you disable Cluster Autoscaler, prove Karpenter can launch nodes.

bash

Deploy a test pod that forces Karpenter to provision

kubectl run test-karpenter --image=nginx --requests=cpu=1,memory=2Gi

Watch for node creation

kubectl get nodes -w

Check Karpenter logs

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter [-f

You should](/articles/who-are-the-big-4-ai-agents-and-why-you-should-care) see a new node appear within 90 seconds. If you get an error like "Trying to install karpenter on an existing EKS cluster", it's usually one of three things: IAM role misconfiguration, subnet tags not matching, or the EC2NodeClass role missing permission to launch instances.

Step 5: Drain Cluster Autoscaler Nodes Gradually

Now the scary part of your Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub — migrating running workloads.

bash

Scale Cluster Autoscaler to 0

kubectl scale deployment cluster-autoscaler -n kube-system --replicas=0

Add Karpenter's provisioner name as a node label to all existing nodes

Add Karpenter's provisioner name as a node label to all existing nodes

kubectl label nodes --all karpenter.sh/provisioner-name=default

Pause here. Don't delete your ASGs yet.

Wait 24-48 hours. Watch for issues. Monitor pod scheduling latency. Check instance utilization. The AWS documentation recommends this gradual approach, and I've learned it saves you from emergency rollbacks.

During this period, you can manually drain nodes from an ASG and let Karpenter replace them:

bash

Cordon an ASG node

kubectl cordon

Drain it

kubectl drain --ignore-daemonsets --delete-emptydir-data

The ASG will replace it, Karpenter won't

When Karpenter launches a new node for the replaced pods, you know it's working. Check the node's labels — if it has karpenter.sh/provisioner-name: default, you're golden.

Step 6: Remove Cluster Autoscaler and ASGs

Once you're confident, clean up. Delete the Cluster Autoscaler deployment. Delete your ASGs (but keep the launch templates for 30 days as insurance).

bash

Remove Cluster Autoscaler

kubectl delete deployment cluster-autoscaler -n kube-system

Delete ASGs via AWS CLI

aws autoscaling delete-auto-scaling-group --auto-scaling-group-name your-asg-name --force-delete


Handling Edge Cases That Will Break Your Migration

1. DaemonSets and Node Replacement

If you run DaemonSets (like Datadog, Calico, or AWS VPC CNI), Karpenter's consolidation might replace nodes faster than your DaemonSets can drain. The result: pods stuck in Terminating because the DaemonSet pod on the old node refuses to die.

Fix: Add a terminationGracePeriodSeconds of 30-60 seconds in your NodePool's spec.template.spec.container. Or set karpenter.sh/do-not-consolidate: "true" as a label on DaemonSet pods.

2. PVC Stuck on Pending

Karpenter provisions nodes for pending pods. But if a PersistentVolumeClaim has a nodeSelector that doesn't match any instance type Karpenter knows about, the pod stays pending.

Fix: Check your StorageClass's allowedTopologies and ensure Karpenter's instance selection includes the zones your EBS volumes are in.

3. The 8443 Webhook Issue

I've seen this more times than I'd like. After upgrading Karpenter from v0.37.7 to v1.1.2, one team at a fintech company hit the port 8443 error. The fix was in the Helm chart configuration. As documented in the GitHub issue, adding --set webhook.port=8443 resolves it.

4. Spot Instance Interruptions

Karpenter handles spot interruptions natively — it preemptively drains nodes when it receives a 2-minute spot termination notice. But if you're running stateful workloads on spot, you need to handle the interruption yourself. Use PodDisruptionBudgets. Karpenter respects them.


Advanced NodePool Configuration

Once you're past the migration, you can get creative with NodePools.

Multi-Architecture Pool

yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: arm-amd-mix
spec:
template:
spec:
requirements:

  • key: kubernetes.io/arch
    operator: In
    values: ["amd64", "arm64"]
  • key: karpenter.sh/capacity-type
    operator: In
    values: ["spot"]
    nodeClassRef:
    name: arm-amd-class
    disruption:
    consolidationPolicy: WhenUnderused
    consolidateAfter: 1m

This lets Karpenter pick between Graviton (arm64) and x86 instances based on price. In my tests, Graviton instances cost 20% less, so Karpenter prefers them. Only falls back to x86 when ARM isn't available.

GPU NodePool with GPUDriven Consolidation

yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-workloads
spec:
template:
spec:
requirements:

  • key: karpenter.k8s.aws/instance-family
    operator: In
    values: ["p3", "p4", "g5", "g6"]
  • key: nvidia.com/gpu
    operator: Exists
    nodeClassRef:
    name: gpu-class
    limits:
    nvidia.com/gpu: 48
    disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

consolidationPolicy: WhenEmpty is critical for GPUs. You don't want Karpenter replacing a GPU node just because it has spare CPU. Only consolidate when the node is completely empty.


Monitoring and Observing Karpenter

You need to watch three things post-migration in your Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub:

  1. Node creation latency — Should be under 120 seconds
  2. Spot interrupt rate — If >5% over a week, add more instance diversity
  3. Consolidation activity — Too many consolidations means back-and-forth with scheduled pods

I set up a dashboard with these metrics:

bash

Enable metrics scraping

kubectl annotate pod -n karpenter -l app.kubernetes.io/name=karpenter prometheus.io/scrape=true
kubectl annotate pod -n karpenter -l app.kubernetes.io/name=karpenter prometheus.io/port=8080

Karpenter exposes metrics on port 8080. Key metrics to alert on:

  • karpenter_nodes_created — should increase when pods are pending
  • karpenter_cloudprovider_errors_total — if non-zero for 5 minutes, something's wrong
  • karpenter_disruption_actions_performed — tracks consolidation success

When NOT to Migrate

I'll be honest. Not every cluster needs this Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub executed.

Don't migrate if:

  • Your cluster has fewer than 5 nodes
  • You have zero workload variation (all pods are identical)
  • You're on EKS 1.22 or earlier (Karpenter 1.x requires 1.23+)
  • You don't have time to test for 48 hours
  • Your team can't handle the learning curve right now

Cluster Autoscaler is fine for small, homogenous clusters. Karpenter shines when you have diverse workloads, spot instances, or cost optimization requirements.


FAQ

What happens to existing ASG nodes after migration?

They keep running until you terminate them. Karpenter won't manage or drain them. You must manually cordon and drain each node or delete the ASG.

Can I run Cluster Autoscaler and Karpenter simultaneously?

Technically yes, but don't. They'll fight over node scheduling. Karpenter might launch a node while Cluster Autoscaler tries to scale down. Scale CA to 0 first.

How long does this Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub take to execute?

One to three days for a proper rollout. Day one: install Karpenter and validate. Day two: drain nodes gradually. Day three: remove CA and ASGs.

Does Karpenter support Windows nodes?

As of Karpenter 1.x, Windows support is limited. Check the GitHub issue tracker for the latest status.

How does Karpenter handle spot interruption?

It subscribes to the EC2 Spot Interruption SQS queue. When a termination notice arrives, it cordons the node and evicts pods with proper grace periods.

Do I need to update my pod specs?

Not usually. Karpenter respects nodeSelector, affinity, and tolerations. But if you used ASG-specific taints, you need to translate those to NodePool settings.

What's the cost of running Karpenter?

Karpenter itself is free (open source). You pay for two things: an SQS queue (pennies per month) and the EC2 instances it provisions. No licensing, no SaaS fees.


Final Thoughts

Migrating from Cluster Autoscaler to Karpenter isn't about following a checklist — it's about understanding how capacity works in your cluster. This Karpenter migration from Cluster Autoscaler NodePool EC2NodeClass tutorial GitHub gives you the blueprint, but you need to adapt it to your environment.

The teams I've seen succeed shared one trait: they tested Karpenter on non-critical workloads first. They let it prove itself. They didn't treat it as a "fire and forget" migration.

Karpenter changes your relationship with infrastructure. You stop thinking in terms of "node groups" and start thinking in terms of "workload requirements." That's liberating. It's also scary if you're used to controlling everything.

Start small. Use the default NodePool and EC2NodeClass. Tune after you see the metrics. And for god's sake, don't skip the 48-hour validation window.

The tool is ready. Your architecture might not be. That's okay — that's what migrations are for.


Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.


Sources

Sources
  1. Migrating from Cluster Autoscaler — Karpenter documentation
  2. No service port 8443 found for service "karpenter" — GitHub issue
  3. Trying to install karpenter on an existing EKS cluster — GitHub issue
  4. Migrating from EKS Cluster Autoscaler to Karpenter — dev.to article
  5. Ability to Scale Karpenter Provisioned Nodes — GitHub discussion
  6. Migrate Kubernetes cluster autoscaler to Karpenter — YouTube tutorial
  7. Migrating From Cluster Autoscaler to Karpenter v0.32 — The New Stack
  8. EKS Karpenter upgrade from 0.37.7 to 1.0.8 to 1.1.2 — Medium
  9. Docs on Migrating from Cluster Autoscaler to Karpenter — GitHub issue
  10. Mass migration to Karpenter — Reddit discussion

Free · No Commitment · 48-Hour Delivery

Get a free infrastructure audit

2-hour remote session. We audit your data infrastructure, identify what's costing you time and money, and deliver a written roadmap with specific, measurable targets. No pitch.

Book Your Free Audit
N
Nishaant Dixit
Founder & Lead Engineer at SIVARO

Building data-intensive systems since 2018. 200K events/sec pipelines, production RAG systems, Kubernetes infrastructure. LinkedIn →

Start a Project
Need help with infrastructure?

Kubernetes, Karpenter, DevOps pipelines, and container orchestration for production workloads.

Explore MVP to Production