Skip to main content
Auto Nodes brings Karpenter-powered autoscaling to vCluster private nodes, enabling automatic node provisioning and deprovisioning based on workload demand. It works across public cloud, private cloud, hybrid environments, and bare metal infrastructure.
Introduced in v0.28: Auto Nodes is a Pro feature available in vCluster Pro and requires vCluster Platform connection.

How It Works

Auto Nodes integrates Karpenter’s provisioning logic with vCluster private nodes. When pods can’t be scheduled due to resource constraints, Auto Nodes automatically provisions new machines and joins them to the virtual cluster:
┌─────────────────────────────────────────────────────────────┐
│               vCluster with Auto Nodes                      │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Control Plane                                       │   │
│  │  - API Server                                       │   │
│  │  - Karpenter Controller (watches unschedulable)    │   │
│  │  - Node Provider Integration                        │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│                          │ Detects unschedulable pods       │
│                          ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Karpenter Provisioner                               │   │
│  │  - Evaluates pod requirements                       │   │
│  │  - Selects appropriate node type                    │   │
│  │  - Calls node provider API                          │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
└──────────────────────────┼──────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│           Node Provider (AWS/Azure/GCP/Bare Metal)          │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Create VM/   │  │ Install      │  │ Join vCluster│     │
│  │ Instance     │→ │ vCluster     │→ │ as node      │     │
│  └──────────────┘  └──────────────┘  └──────────────┘     │
│                                                             │
└─────────────────────────────────────────────────────────────┘


            New node automatically joins vCluster
               and pods are scheduled on it

Key Characteristics

Demand-Based: Provisions nodes only when needed
Multi-Cloud: Works across AWS, Azure, GCP, bare metal
Cost Optimization: Automatically deprovisions idle nodes
Workload-Aware: Selects node types based on pod requirements
GPU Support: Automatically provisions GPU nodes when requested
Fast Provisioning: Nodes ready in 2-5 minutes

Configuration

Basic Setup

Auto Nodes requires private nodes mode and a node provider:
auto-nodes.yaml
# Enable private nodes
privateNodes:
  enabled: true
  
  # Configure auto nodes
  autoNodes:
    - name: default-nodes
      enabled: true
      nodeProvider: aws  # or azure, gcp, hetzner, etc.
      
      # Node requirements
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - t3.medium
            - t3.large
      
      # Provisioner limits
      limits:
        cpu: 100
        memory: 400Gi

# Control plane must be exposed
controlPlane:
  service:
    spec:
      type: LoadBalancer

# Pod network
networking:
  podCIDR: "10.244.0.0/16"

# CNI and storage
deploy:
  cni:
    flannel:
      enabled: true
  localPathProvisioner:
    enabled: true

Node Provider Configuration

Each cloud provider requires specific configuration:
privateNodes:
  autoNodes:
    - name: aws-nodes
      enabled: true
      nodeProvider: aws
      
      # AWS-specific configuration
      providerConfig:
        region: us-east-1
        subnetSelector:
          karpenter.sh/discovery: my-cluster
        securityGroupSelector:
          karpenter.sh/discovery: my-cluster
        instanceProfile: KarpenterNodeInstanceProfile
        amiFamily: AL2  # Amazon Linux 2
      
      # Instance types
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - t3.medium
            - t3.large
            - t3.xlarge
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      
      # Limits
      limits:
        cpu: 1000
        memory: 4000Gi
IAM Permissions Required:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:RunInstances",
        "ec2:CreateTags",
        "ec2:TerminateInstances",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeSubnets",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeImages",
        "iam:PassRole"
      ],
      "Resource": "*"
    }
  ]
}

GPU Nodes

Configure auto-provisioning for GPU workloads:
privateNodes:
  autoNodes:
    - name: gpu-nodes
      enabled: true
      nodeProvider: aws
      
      # GPU instance types
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - p3.2xlarge   # NVIDIA V100
            - p3.8xlarge
            - g4dn.xlarge  # NVIDIA T4
            - g4dn.4xlarge
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]  # GPU instances usually on-demand
      
      # GPU limits
      limits:
        nvidia.com/gpu: 16
        cpu: 256
        memory: 1024Gi
      
      # Consolidation settings
      consolidation:
        enabled: false  # Don't consolidate GPU nodes (expensive restarts)
Using GPU Nodes:
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
    - name: cuda
      image: nvidia/cuda:12.0-base
      resources:
        limits:
          nvidia.com/gpu: 1  # Auto Nodes provisions GPU node
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

Multiple Node Pools

Configure different pools for different workload types:
privateNodes:
  autoNodes:
    # General purpose nodes
    - name: general
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [t3.medium, t3.large]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
      limits:
        cpu: 100
        memory: 400Gi
      weight: 100  # Prefer general nodes
    
    # High-memory nodes
    - name: high-memory
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [r5.xlarge, r5.2xlarge, r5.4xlarge]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
      limits:
        cpu: 200
        memory: 1600Gi
      weight: 50
    
    # GPU nodes
    - name: gpu
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [g4dn.xlarge, g4dn.2xlarge]
      limits:
        nvidia.com/gpu: 8
        cpu: 64
        memory: 256Gi
      weight: 10  # Lowest priority (most expensive)

Advanced Configuration

Consolidation

Automatically consolidate underutilized nodes:
privateNodes:
  autoNodes:
    - name: default
      enabled: true
      nodeProvider: aws
      
      # Consolidation configuration
      consolidation:
        enabled: true
      
      # Disruption budget
      disruption:
        consolidationPolicy: WhenUnderutilized
        consolidateAfter: 30s
        expireAfter: 720h  # 30 days
        budgets:
          - nodes: "10%"  # Disrupt max 10% of nodes at once

Taints and Labels

Apply custom taints and labels to provisioned nodes:
privateNodes:
  autoNodes:
    - name: default
      enabled: true
      nodeProvider: aws
      
      # Node template
      template:
        metadata:
          labels:
            workload-type: general
            managed-by: karpenter
            environment: production
        spec:
          taints:
            - key: workload-type
              value: general
              effect: NoSchedule
          startupTaints:
            - key: node.kubernetes.io/not-ready
              value: "true"
              effect: NoSchedule

Node Lifecycle

Control node lifecycle behavior:
privateNodes:
  autoNodes:
    - name: default
      enabled: true
      nodeProvider: aws
      
      # Lifecycle settings
      ttlSecondsUntilExpired: 2592000  # 30 days
      ttlSecondsAfterEmpty: 30  # Delete after 30s if empty
      
      # Limits
      limits:
        cpu: 1000
        memory: 4000Gi
      
      # Weight for provisioner selection
      weight: 100

User Data (Cloud-Init)

Customize node initialization:
privateNodes:
  autoNodes:
    - name: default
      enabled: true
      nodeProvider: aws
      
      providerConfig:
        userData: |
          #!/bin/bash
          # Custom initialization
          echo "Initializing vCluster node..."
          
          # Install monitoring agent
          curl -o /tmp/agent.sh https://monitoring.example.com/install.sh
          bash /tmp/agent.sh
          
          # Configure logging
          mkdir -p /var/log/vcluster
          systemctl enable vcluster-logging

Use Cases

Cost-Optimized Development

Perfect for: Development environments, testing, CI/CD
privateNodes:
  autoNodes:
    - name: dev-spot
      enabled: true
      nodeProvider: aws
      
      # Prefer spot instances
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]  # Spot only for cost savings
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - t3.medium
            - t3.large
      
      # Aggressive consolidation
      consolidation:
        enabled: true
      
      # Delete nodes quickly when unused
      ttlSecondsAfterEmpty: 60
      
      limits:
        cpu: 50
        memory: 200Gi
Savings: 70-90% cost reduction using spot instances

Production with Mixed Capacity

Perfect for: Production workloads with cost optimization
privateNodes:
  autoNodes:
    - name: on-demand
      enabled: true
      nodeProvider: aws
      
      # On-demand for critical workloads
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [m5.large, m5.xlarge, m5.2xlarge]
      
      limits:
        cpu: 100
        memory: 400Gi
      
      weight: 10  # Lowest priority (prefer spot)
    
    - name: spot
      enabled: true
      nodeProvider: aws
      
      # Spot for scalable workloads
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [m5.large, m5.xlarge, m5.2xlarge]
      
      limits:
        cpu: 400
        memory: 1600Gi
      
      weight: 100  # Highest priority
Configuration for workloads:
# Critical workload - require on-demand
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  template:
    spec:
      nodeSelector:
        karpenter.sh/capacity-type: on-demand
      containers:
        - name: app
          image: critical-app:latest

---
# Scalable workload - allow spot
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 10
  template:
    spec:
      # No node selector - can use spot
      containers:
        - name: app
          image: web-app:latest

AI/ML with GPU Autoscaling

Perfect for: ML training, inference workloads
privateNodes:
  autoNodes:
    # CPU nodes for supporting services
    - name: cpu
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [c5.2xlarge, c5.4xlarge]
      limits:
        cpu: 100
        memory: 400Gi
    
    # GPU nodes for training
    - name: gpu-training
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [p3.2xlarge, p3.8xlarge, p3.16xlarge]
      limits:
        nvidia.com/gpu: 32
        cpu: 512
        memory: 2048Gi
      consolidation:
        enabled: false  # Don't disrupt training jobs
    
    # GPU nodes for inference
    - name: gpu-inference
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [g4dn.xlarge, g4dn.2xlarge]
      limits:
        nvidia.com/gpu: 16
        cpu: 128
        memory: 512Gi
      consolidation:
        enabled: true  # Inference can be rescheduled
      ttlSecondsAfterEmpty: 300  # Keep warm for 5 minutes

Hybrid Cloud Bursting

Perfect for: On-prem primary, cloud burst for overflow
privateNodes:
  autoNodes:
    # On-premises nodes (manually joined)
    # Auto Nodes doesn't provision these, they're static
    
    # AWS burst capacity
    - name: aws-burst
      enabled: true
      nodeProvider: aws
      requirements:
        - key: topology.kubernetes.io/zone
          operator: In
          values: [us-east-1a, us-east-1b]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [m5.large, m5.xlarge]
      limits:
        cpu: 200  # Limit cloud spend
        memory: 800Gi
      weight: 10  # Low priority (prefer on-prem)
Label on-prem nodes:
kubectl label nodes on-prem-1 on-prem-2 location=on-prem
Workload preference:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: location
                    operator: In
                    values: [on-prem]

Monitoring

Metrics

Auto Nodes exposes Prometheus metrics:
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vcluster-karpenter
spec:
  selector:
    matchLabels:
      app: vcluster-karpenter
  endpoints:
    - port: metrics
Key Metrics:
  • karpenter_nodes_total - Total nodes managed
  • karpenter_nodes_allocatable - Allocatable resources
  • karpenter_pods_state - Pod scheduling state
  • karpenter_interruption_received_messages - Spot interruption notices
  • karpenter_provisioner_scheduling_duration_seconds - Provisioning time

Logs

View provisioning decisions:
# Karpenter controller logs
kubectl logs -n vcluster-my-vcluster <vcluster-pod> -c karpenter

# Look for:
# - "found provisionable pod(s)"
# - "computed new machine(s) to fit pod(s)"
# - "created machine"

Dashboard

Example Grafana dashboard:
{
  "dashboard": {
    "title": "vCluster Auto Nodes",
    "panels": [
      {
        "title": "Node Count",
        "targets": [
          {"expr": "sum(karpenter_nodes_total)"}
        ]
      },
      {
        "title": "CPU Utilization",
        "targets": [
          {"expr": "sum(karpenter_nodes_allocatable{resource_type='cpu'})"}
        ]
      },
      {
        "title": "Pending Pods",
        "targets": [
          {"expr": "sum(karpenter_pods_state{state='pending'})"}
        ]
      }
    ]
  }
}

Troubleshooting

Nodes Not Provisioning

Symptom: Pods stay pending, no new nodes created
# Check Karpenter logs
kubectl logs -n vcluster-my-vcluster <vcluster-pod> -c karpenter

# Common issues:
# 1. Limits reached
# 2. No matching requirements
# 3. Cloud provider API errors
# 4. Insufficient permissions
Solutions:
# Check limits
kubectl describe nodepool

# Verify cloud credentials
kubectl get secret -n vcluster-my-vcluster

# Test cloud provider API access
aws ec2 describe-instances  # AWS
az vm list  # Azure
gcloud compute instances list  # GCP

Slow Provisioning

Symptom: Nodes take too long to become ready
# Check provisioning time
kubectl get machine -o yaml | grep provisioningTime

# Typical times:
# AWS: 2-3 minutes
# Azure: 3-4 minutes
# GCP: 2-3 minutes
# Bare Metal: 5-10 minutes
Optimization:
privateNodes:
  autoNodes:
    - name: default
      enabled: true
      
      # Use AMI with pre-installed components
      providerConfig:
        amiFamily: Bottlerocket  # Optimized AMI
        
      # Reduce startup time
      template:
        spec:
          startupTaints: []  # Remove startup taints if safe

Excessive Churn

Symptom: Nodes constantly being created/deleted
# Check consolidation settings
kubectl describe nodepool

# Monitor events
kubectl get events --sort-by='.lastTimestamp' | grep -i node
Solution:
privateNodes:
  autoNodes:
    - name: default
      enabled: true
      
      # Reduce churn
      ttlSecondsAfterEmpty: 300  # Wait longer before deleting
      consolidation:
        enabled: true
        consolidateAfter: 300s  # Wait longer before consolidating

Cost Overruns

Symptom: Cloud bill higher than expected
# Check node count and types
kubectl get nodes -o custom-columns=NAME:.metadata.name,INSTANCE-TYPE:.metadata.labels.node\\.kubernetes\\.io/instance-type

# Check limits
kubectl get nodepool -o yaml | grep -A5 limits
Cost Controls:
privateNodes:
  autoNodes:
    - name: default
      enabled: true
      
      # Strict limits
      limits:
        cpu: 100  # Hard cap
        memory: 400Gi
      
      # Prefer cheap instances
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]  # Spot only
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [t3.small, t3.medium]  # Small instances only
      
      # Aggressive cleanup
      ttlSecondsAfterEmpty: 30

Best Practices

Always set limits to prevent runaway costs:
limits:
  cpu: 1000
  memory: 4000Gi
  nvidia.com/gpu: 16  # If using GPUs
Separate workload types for better optimization:
autoNodes:
  - name: general  # General workloads
  - name: compute  # CPU-intensive
  - name: memory   # Memory-intensive
  - name: gpu      # GPU workloads
Reduce costs by consolidating underutilized nodes:
consolidation:
  enabled: true
Set up alerts for provisioning failures:
# Alert when pods are pending too long
alert: PodsPendingTooLong
expr: karpenter_pods_state{state="pending"} > 0 for 10m
Use spot/preemptible for non-critical workloads:
requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]
Handle spot instance interruptions:
# Use node termination handler
# PodDisruptionBudgets
# Multiple replicas

Comparison

Auto Nodes vs Manual Provisioning

AspectManualAuto Nodes
Setup TimeHoursMinutes
ScalingManualAutomatic
Cost OptimizationLimitedExcellent
Operational OverheadHighLow
FlexibilityHighVery High
ComplexityLowMedium

Auto Nodes vs Cluster Autoscaler

AspectCluster AutoscalerAuto Nodes
Provisioning SpeedSlowerFaster
Bin PackingBasicAdvanced
Cost OptimizationGoodBetter
GPU SupportLimitedExcellent
Multi-CloudLimitedYes
Bare MetalNoYes

Next Steps

Private Nodes

Learn about the foundation for Auto Nodes.

Cost Optimization

Advanced cost optimization strategies.

GPU Workloads

Running GPU workloads with auto-scaling.

Node Configuration

Complete Auto Nodes configuration reference.

Build docs developers (and LLMs) love