Auto Nodes - vCluster

Auto Nodes brings Karpenter-powered autoscaling to vCluster private nodes, enabling automatic node provisioning and deprovisioning based on workload demand. It works across public cloud, private cloud, hybrid environments, and bare metal infrastructure.

Introduced in v0.28: Auto Nodes is a Pro feature available in vCluster Pro and requires vCluster Platform connection.

How It Works

Auto Nodes integrates Karpenter’s provisioning logic with vCluster private nodes. When pods can’t be scheduled due to resource constraints, Auto Nodes automatically provisions new machines and joins them to the virtual cluster:

┌─────────────────────────────────────────────────────────────┐
│               vCluster with Auto Nodes                      │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Control Plane                                       │   │
│  │  - API Server                                       │   │
│  │  - Karpenter Controller (watches unschedulable)    │   │
│  │  - Node Provider Integration                        │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│                          │ Detects unschedulable pods       │
│                          ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Karpenter Provisioner                               │   │
│  │  - Evaluates pod requirements                       │   │
│  │  - Selects appropriate node type                    │   │
│  │  - Calls node provider API                          │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
└──────────────────────────┼──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│           Node Provider (AWS/Azure/GCP/Bare Metal)          │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Create VM/   │  │ Install      │  │ Join vCluster│     │
│  │ Instance     │→ │ vCluster     │→ │ as node      │     │
│  └──────────────┘  └──────────────┘  └──────────────┘     │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
            New node automatically joins vCluster
               and pods are scheduled on it

Key Characteristics

Demand-Based: Provisions nodes only when needed

Multi-Cloud: Works across AWS, Azure, GCP, bare metal

Cost Optimization: Automatically deprovisions idle nodes

Workload-Aware: Selects node types based on pod requirements

GPU Support: Automatically provisions GPU nodes when requested

Fast Provisioning: Nodes ready in 2-5 minutes

Configuration

Basic Setup

Auto Nodes requires private nodes mode and a node provider:

auto-nodes.yaml

# Enable private nodes
privateNodes:
  enabled: true
  
  # Configure auto nodes
  autoNodes:
    - name: default-nodes
      enabled: true
      nodeProvider: aws  # or azure, gcp, hetzner, etc.
      
      # Node requirements
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - t3.medium
            - t3.large
      
      # Provisioner limits
      limits:
        cpu: 100
        memory: 400Gi

# Control plane must be exposed
controlPlane:
  service:
    spec:
      type: LoadBalancer

# Pod network
networking:
  podCIDR: "10.244.0.0/16"

# CNI and storage
deploy:
  cni:
    flannel:
      enabled: true
  localPathProvisioner:
    enabled: true

Node Provider Configuration

Each cloud provider requires specific configuration:

AWS
Azure
GCP
Hetzner
Bare Metal

privateNodes:
  autoNodes:
    - name: aws-nodes
      enabled: true
      nodeProvider: aws
      
      # AWS-specific configuration
      providerConfig:
        region: us-east-1
        subnetSelector:
          karpenter.sh/discovery: my-cluster
        securityGroupSelector:
          karpenter.sh/discovery: my-cluster
        instanceProfile: KarpenterNodeInstanceProfile
        amiFamily: AL2  # Amazon Linux 2
      
      # Instance types
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - t3.medium
            - t3.large
            - t3.xlarge
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      
      # Limits
      limits:
        cpu: 1000
        memory: 4000Gi

IAM Permissions Required:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:RunInstances",
        "ec2:CreateTags",
        "ec2:TerminateInstances",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeSubnets",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeImages",
        "iam:PassRole"
      ],
      "Resource": "*"
    }
  ]
}

privateNodes:
  autoNodes:
    - name: azure-nodes
      enabled: true
      nodeProvider: azure
      
      # Azure-specific configuration
      providerConfig:
        location: eastus
        resourceGroup: my-vcluster-rg
        vnetName: my-vnet
        subnetName: my-subnet
        
      # VM sizes
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - Standard_D2s_v3
            - Standard_D4s_v3
            - Standard_D8s_v3
      
      limits:
        cpu: 1000
        memory: 4000Gi

Azure Service Principal Required:

az ad sp create-for-rbac --name vcluster-karpenter \
  --role Contributor \
  --scopes /subscriptions/{subscription-id}/resourceGroups/{resource-group}

privateNodes:
  autoNodes:
    - name: gcp-nodes
      enabled: true
      nodeProvider: gcp
      
      # GCP-specific configuration
      providerConfig:
        project: my-project
        zone: us-central1-a
        network: default
        subnetwork: default
        serviceAccount: [email protected]
      
      # Machine types
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - n1-standard-2
            - n1-standard-4
            - n1-standard-8
      
      limits:
        cpu: 1000
        memory: 4000Gi

privateNodes:
  autoNodes:
    - name: hetzner-nodes
      enabled: true
      nodeProvider: hetzner
      
      # Hetzner-specific configuration
      providerConfig:
        location: nbg1  # Nuremberg
        networkId: 123456
        
      # Server types
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - cx21   # 2 vCPU, 4GB RAM
            - cx31   # 2 vCPU, 8GB RAM
            - cx41   # 4 vCPU, 16GB RAM
      
      limits:
        cpu: 100
        memory: 400Gi

privateNodes:
  autoNodes:
    - name: bare-metal-nodes
      enabled: true
      nodeProvider: baremetal
      
      # Bare metal provisioning
      providerConfig:
        # Integration with tools like Tinkerbell, MAAS, or custom
        provisionerEndpoint: https://metal-provisioner.example.com
        machinePool: worker-pool
        
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - small   # 8 CPU, 32GB RAM
            - medium  # 16 CPU, 64GB RAM
            - large   # 32 CPU, 128GB RAM
      
      limits:
        cpu: 512
        memory: 2048Gi

GPU Nodes

Configure auto-provisioning for GPU workloads:

privateNodes:
  autoNodes:
    - name: gpu-nodes
      enabled: true
      nodeProvider: aws
      
      # GPU instance types
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - p3.2xlarge   # NVIDIA V100
            - p3.8xlarge
            - g4dn.xlarge  # NVIDIA T4
            - g4dn.4xlarge
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]  # GPU instances usually on-demand
      
      # GPU limits
      limits:
        nvidia.com/gpu: 16
        cpu: 256
        memory: 1024Gi
      
      # Consolidation settings
      consolidation:
        enabled: false  # Don't consolidate GPU nodes (expensive restarts)

Using GPU Nodes:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
    - name: cuda
      image: nvidia/cuda:12.0-base
      resources:
        limits:
          nvidia.com/gpu: 1  # Auto Nodes provisions GPU node
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

Multiple Node Pools

Configure different pools for different workload types:

privateNodes:
  autoNodes:
    # General purpose nodes
    - name: general
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [t3.medium, t3.large]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
      limits:
        cpu: 100
        memory: 400Gi
      weight: 100  # Prefer general nodes
    
    # High-memory nodes
    - name: high-memory
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [r5.xlarge, r5.2xlarge, r5.4xlarge]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
      limits:
        cpu: 200
        memory: 1600Gi
      weight: 50
    
    # GPU nodes
    - name: gpu
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [g4dn.xlarge, g4dn.2xlarge]
      limits:
        nvidia.com/gpu: 8
        cpu: 64
        memory: 256Gi
      weight: 10  # Lowest priority (most expensive)

Advanced Configuration

Consolidation

Automatically consolidate underutilized nodes:

privateNodes:
  autoNodes:
    - name: default
      enabled: true
      nodeProvider: aws
      
      # Consolidation configuration
      consolidation:
        enabled: true
      
      # Disruption budget
      disruption:
        consolidationPolicy: WhenUnderutilized
        consolidateAfter: 30s
        expireAfter: 720h  # 30 days
        budgets:
          - nodes: "10%"  # Disrupt max 10% of nodes at once

Taints and Labels

Apply custom taints and labels to provisioned nodes:

privateNodes:
  autoNodes:
    - name: default
      enabled: true
      nodeProvider: aws
      
      # Node template
      template:
        metadata:
          labels:
            workload-type: general
            managed-by: karpenter
            environment: production
        spec:
          taints:
            - key: workload-type
              value: general
              effect: NoSchedule
          startupTaints:
            - key: node.kubernetes.io/not-ready
              value: "true"
              effect: NoSchedule

Node Lifecycle

Control node lifecycle behavior:

privateNodes:
  autoNodes:
    - name: default
      enabled: true
      nodeProvider: aws
      
      # Lifecycle settings
      ttlSecondsUntilExpired: 2592000  # 30 days
      ttlSecondsAfterEmpty: 30  # Delete after 30s if empty
      
      # Limits
      limits:
        cpu: 1000
        memory: 4000Gi
      
      # Weight for provisioner selection
      weight: 100

User Data (Cloud-Init)

Customize node initialization:

privateNodes:
  autoNodes:
    - name: default
      enabled: true
      nodeProvider: aws
      
      providerConfig:
        userData: |
          #!/bin/bash
          # Custom initialization
          echo "Initializing vCluster node..."
          
          # Install monitoring agent
          curl -o /tmp/agent.sh https://monitoring.example.com/install.sh
          bash /tmp/agent.sh
          
          # Configure logging
          mkdir -p /var/log/vcluster
          systemctl enable vcluster-logging

Use Cases

Cost-Optimized Development

Perfect for: Development environments, testing, CI/CD

privateNodes:
  autoNodes:
    - name: dev-spot
      enabled: true
      nodeProvider: aws
      
      # Prefer spot instances
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]  # Spot only for cost savings
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - t3.medium
            - t3.large
      
      # Aggressive consolidation
      consolidation:
        enabled: true
      
      # Delete nodes quickly when unused
      ttlSecondsAfterEmpty: 60
      
      limits:
        cpu: 50
        memory: 200Gi

Savings: 70-90% cost reduction using spot instances

Production with Mixed Capacity

Perfect for: Production workloads with cost optimization

privateNodes:
  autoNodes:
    - name: on-demand
      enabled: true
      nodeProvider: aws
      
      # On-demand for critical workloads
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [m5.large, m5.xlarge, m5.2xlarge]
      
      limits:
        cpu: 100
        memory: 400Gi
      
      weight: 10  # Lowest priority (prefer spot)
    
    - name: spot
      enabled: true
      nodeProvider: aws
      
      # Spot for scalable workloads
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [m5.large, m5.xlarge, m5.2xlarge]
      
      limits:
        cpu: 400
        memory: 1600Gi
      
      weight: 100  # Highest priority

Configuration for workloads:

# Critical workload - require on-demand
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  template:
    spec:
      nodeSelector:
        karpenter.sh/capacity-type: on-demand
      containers:
        - name: app
          image: critical-app:latest

---
# Scalable workload - allow spot
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 10
  template:
    spec:
      # No node selector - can use spot
      containers:
        - name: app
          image: web-app:latest

AI/ML with GPU Autoscaling

Perfect for: ML training, inference workloads

privateNodes:
  autoNodes:
    # CPU nodes for supporting services
    - name: cpu
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [c5.2xlarge, c5.4xlarge]
      limits:
        cpu: 100
        memory: 400Gi
    
    # GPU nodes for training
    - name: gpu-training
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [p3.2xlarge, p3.8xlarge, p3.16xlarge]
      limits:
        nvidia.com/gpu: 32
        cpu: 512
        memory: 2048Gi
      consolidation:
        enabled: false  # Don't disrupt training jobs
    
    # GPU nodes for inference
    - name: gpu-inference
      enabled: true
      nodeProvider: aws
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [g4dn.xlarge, g4dn.2xlarge]
      limits:
        nvidia.com/gpu: 16
        cpu: 128
        memory: 512Gi
      consolidation:
        enabled: true  # Inference can be rescheduled
      ttlSecondsAfterEmpty: 300  # Keep warm for 5 minutes

Hybrid Cloud Bursting

Perfect for: On-prem primary, cloud burst for overflow

privateNodes:
  autoNodes:
    # On-premises nodes (manually joined)
    # Auto Nodes doesn't provision these, they're static
    
    # AWS burst capacity
    - name: aws-burst
      enabled: true
      nodeProvider: aws
      requirements:
        - key: topology.kubernetes.io/zone
          operator: In
          values: [us-east-1a, us-east-1b]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [m5.large, m5.xlarge]
      limits:
        cpu: 200  # Limit cloud spend
        memory: 800Gi
      weight: 10  # Low priority (prefer on-prem)

Label on-prem nodes:

kubectl label nodes on-prem-1 on-prem-2 location=on-prem

Workload preference:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: location
                    operator: In
                    values: [on-prem]

Monitoring

Metrics

Auto Nodes exposes Prometheus metrics:

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vcluster-karpenter
spec:
  selector:
    matchLabels:
      app: vcluster-karpenter
  endpoints:
    - port: metrics

Key Metrics:

karpenter_nodes_total - Total nodes managed
karpenter_nodes_allocatable - Allocatable resources
karpenter_pods_state - Pod scheduling state
karpenter_interruption_received_messages - Spot interruption notices
karpenter_provisioner_scheduling_duration_seconds - Provisioning time

Logs

View provisioning decisions:

# Karpenter controller logs
kubectl logs -n vcluster-my-vcluster <vcluster-pod> -c karpenter

# Look for:
# - "found provisionable pod(s)"
# - "computed new machine(s) to fit pod(s)"
# - "created machine"

Dashboard

Example Grafana dashboard:

{
  "dashboard": {
    "title": "vCluster Auto Nodes",
    "panels": [
      {
        "title": "Node Count",
        "targets": [
          {"expr": "sum(karpenter_nodes_total)"}
        ]
      },
      {
        "title": "CPU Utilization",
        "targets": [
          {"expr": "sum(karpenter_nodes_allocatable{resource_type='cpu'})"}
        ]
      },
      {
        "title": "Pending Pods",
        "targets": [
          {"expr": "sum(karpenter_pods_state{state='pending'})"}
        ]
      }
    ]
  }
}

Troubleshooting

Nodes Not Provisioning

Symptom: Pods stay pending, no new nodes created

# Check Karpenter logs
kubectl logs -n vcluster-my-vcluster <vcluster-pod> -c karpenter

# Common issues:
# 1. Limits reached
# 2. No matching requirements
# 3. Cloud provider API errors
# 4. Insufficient permissions

Solutions:

# Check limits
kubectl describe nodepool

# Verify cloud credentials
kubectl get secret -n vcluster-my-vcluster

# Test cloud provider API access
aws ec2 describe-instances  # AWS
az vm list  # Azure
gcloud compute instances list  # GCP

Slow Provisioning

Symptom: Nodes take too long to become ready

# Check provisioning time
kubectl get machine -o yaml | grep provisioningTime

# Typical times:
# AWS: 2-3 minutes
# Azure: 3-4 minutes
# GCP: 2-3 minutes
# Bare Metal: 5-10 minutes

Optimization:

privateNodes:
  autoNodes:
    - name: default
      enabled: true
      
      # Use AMI with pre-installed components
      providerConfig:
        amiFamily: Bottlerocket  # Optimized AMI
        
      # Reduce startup time
      template:
        spec:
          startupTaints: []  # Remove startup taints if safe

Excessive Churn

Symptom: Nodes constantly being created/deleted

# Check consolidation settings
kubectl describe nodepool

# Monitor events
kubectl get events --sort-by='.lastTimestamp' | grep -i node

Solution:

privateNodes:
  autoNodes:
    - name: default
      enabled: true
      
      # Reduce churn
      ttlSecondsAfterEmpty: 300  # Wait longer before deleting
      consolidation:
        enabled: true
        consolidateAfter: 300s  # Wait longer before consolidating

Cost Overruns

Symptom: Cloud bill higher than expected

# Check node count and types
kubectl get nodes -o custom-columns=NAME:.metadata.name,INSTANCE-TYPE:.metadata.labels.node\\.kubernetes\\.io/instance-type

# Check limits
kubectl get nodepool -o yaml | grep -A5 limits

Cost Controls:

privateNodes:
  autoNodes:
    - name: default
      enabled: true
      
      # Strict limits
      limits:
        cpu: 100  # Hard cap
        memory: 400Gi
      
      # Prefer cheap instances
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]  # Spot only
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [t3.small, t3.medium]  # Small instances only
      
      # Aggressive cleanup
      ttlSecondsAfterEmpty: 30

Best Practices

Set Appropriate Limits

Always set limits to prevent runaway costs:

limits:
  cpu: 1000
  memory: 4000Gi
  nvidia.com/gpu: 16  # If using GPUs

Use Multiple Node Pools

Separate workload types for better optimization:

autoNodes:
  - name: general  # General workloads
  - name: compute  # CPU-intensive
  - name: memory   # Memory-intensive
  - name: gpu      # GPU workloads

Enable Consolidation

Reduce costs by consolidating underutilized nodes:

consolidation:
  enabled: true

Monitor Provisioning

Set up alerts for provisioning failures:

# Alert when pods are pending too long
alert: PodsPendingTooLong
expr: karpenter_pods_state{state="pending"} > 0 for 10m

Test with Spot Instances

Use spot/preemptible for non-critical workloads:

requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]

Plan for Interruptions

Handle spot instance interruptions:

# Use node termination handler
# PodDisruptionBudgets
# Multiple replicas

Comparison

Auto Nodes vs Manual Provisioning

Aspect	Manual	Auto Nodes
Setup Time	Hours	Minutes
Scaling	Manual	Automatic
Cost Optimization	Limited	Excellent
Operational Overhead	High	Low
Flexibility	High	Very High
Complexity	Low	Medium

Auto Nodes vs Cluster Autoscaler

Aspect	Cluster Autoscaler	Auto Nodes
Provisioning Speed	Slower	Faster
Bin Packing	Basic	Advanced
Cost Optimization	Good	Better
GPU Support	Limited	Excellent
Multi-Cloud	Limited	Yes
Bare Metal	No	Yes

Next Steps

Private Nodes

Learn about the foundation for Auto Nodes.

Cost Optimization

Advanced cost optimization strategies.

GPU Workloads

Running GPU workloads with auto-scaling.

Node Configuration

Complete Auto Nodes configuration reference.

Get Started

Architecture

Deployment

Operations

Resource Syncing

Use Cases

Security

Integrations

​How It Works

​Key Characteristics

​Configuration

​Basic Setup

​Node Provider Configuration

​GPU Nodes

​Multiple Node Pools

​Advanced Configuration

​Consolidation

​Taints and Labels

​Node Lifecycle

​User Data (Cloud-Init)

​Use Cases

​Cost-Optimized Development

​Production with Mixed Capacity

​AI/ML with GPU Autoscaling

​Hybrid Cloud Bursting

​Monitoring

​Metrics

​Logs

​Dashboard

​Troubleshooting

​Nodes Not Provisioning

​Slow Provisioning

​Excessive Churn

​Cost Overruns

​Best Practices

​Comparison

​Auto Nodes vs Manual Provisioning

​Auto Nodes vs Cluster Autoscaler

​Next Steps

Private Nodes

Cost Optimization

GPU Workloads

Node Configuration

Build docs developers (and LLMs) love

How It Works

Key Characteristics

Configuration

Basic Setup

Node Provider Configuration

GPU Nodes

Multiple Node Pools

Advanced Configuration

Consolidation

Taints and Labels

Node Lifecycle

User Data (Cloud-Init)

Use Cases

Cost-Optimized Development

Production with Mixed Capacity

AI/ML with GPU Autoscaling

Hybrid Cloud Bursting

Monitoring

Metrics

Logs

Dashboard

Troubleshooting

Nodes Not Provisioning

Slow Provisioning

Excessive Churn

Cost Overruns

Best Practices

Comparison

Auto Nodes vs Manual Provisioning

Auto Nodes vs Cluster Autoscaler

Next Steps