Skip to main content

Overview

The NVIDIA NIM Operator supports comprehensive resource management for NIM workloads, including traditional Kubernetes resource requests/limits and Dynamic Resource Allocation (DRA) for GPU resources.

Resource Requirements

Basic Resource Configuration

Configure CPU, memory, and GPU resources using the resources field in your NIMService spec:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-8b
  namespace: nim-service
spec:
  resources:
    requests:
      memory: "16Gi"
      cpu: "4"
      nvidia.com/gpu: 1
    limits:
      memory: "32Gi"
      cpu: "8"
      nvidia.com/gpu: 1
The resources field supports traditional Kubernetes resources (CPU, memory) and custom device plugin resources (nvidia.com/gpu). For DRA-based GPU allocation, use the draResources field instead.

GPU Resource Configuration

Device Plugin (Traditional)

The traditional approach uses the NVIDIA device plugin to allocate GPUs:
resources:
  limits:
    nvidia.com/gpu: 1

Dynamic Resource Allocation (DRA)

DRA provides fine-grained GPU resource allocation with attribute-based selection. This is the recommended approach for production deployments.
Automatically create a DRA resource claim with minimal configuration:
draResources:
- claimCreationSpec:
    devices:
    - name: gpu
      count: 1
      deviceClassName: gpu.nvidia.com
The operator will automatically generate a ResourceClaimTemplate with default settings.

DRA Attribute Selectors

DRA supports attribute-based device selection using various operators:
attributeSelectors[].key
string
required
Device attribute name (e.g., productName, memory, cuda.computeCapability)
attributeSelectors[].op
enum
default:"Equal"
Comparison operator: Equal, NotEqual, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual
attributeSelectors[].value
object
Value to compare against. Supports:
  • boolValue: true/false
  • intValue: numeric value
  • stringValue: string value (max 64 chars)
  • versionValue: semantic version (semver 2.0.0)

DRA Capacity Selectors

Filter devices by resource capacity:
capacitySelectors:
- key: memory
  op: GreaterThanOrEqual
  value: 80Gi
- key: gpu.nvidia.com/bandwidth
  op: GreaterThan
  value: 1TB
capacitySelectors[].key
string
required
Resource name (e.g., memory, gpu.nvidia.com/bandwidth)
capacitySelectors[].op
enum
default:"Equal"
Comparison operator
capacitySelectors[].value
Quantity
required
Kubernetes resource quantity (e.g., 80Gi, 1TB)

Workload Size Examples

Small Workload (1 GPU)

Ideal for testing and development:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-2-1b
spec:
  resources:
    requests:
      memory: 8Gi
      cpu: 2
    limits:
      nvidia.com/gpu: 1
      memory: 16Gi
      cpu: 4

Medium Workload (4 GPUs)

Production workload for medium-sized models:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-70b
spec:
  resources:
    requests:
      memory: 64Gi
      cpu: 16
    limits:
      nvidia.com/gpu: 4
      memory: 128Gi
      cpu: 32
  draResources:
  - claimCreationSpec:
      devices:
      - name: gpu
        count: 4
        attributeSelectors:
        - key: memory
          op: GreaterThanOrEqual
          value:
            intValue: 80

Large Workload (8 GPUs)

Large-scale production deployment:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-405b
spec:
  resources:
    requests:
      memory: 128Gi
      cpu: 32
    limits:
      nvidia.com/gpu: 8
      memory: 256Gi
      cpu: 64

Multi-Node Workload

For models requiring multiple nodes with tensor/pipeline parallelism:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: deepseek-r1
spec:
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 8
    requests:
      nvidia.com/gpu: 8
  multiNode:
    parallelism:
      pipeline: 2
      tensor: 8
    mpi:
      mpiStartTimeout: 6000
Multi-node NIMService requires NVLink-enabled nodes and cannot be used with autoscaling enabled.

Shared Memory Configuration

NIM containers require shared memory for fast model I/O. Configure the shared memory size:
storage:
  sharedMemorySizeLimit: 16Gi
storage.sharedMemorySizeLimit
Quantity
Maximum size of the shared memory volume (emptyDir with medium=Memory). Recommended: 50% of total GPU memory.

Best Practices

Resource Planning
  1. Memory Overhead: Add 20-30% overhead to model size for runtime requirements
  2. CPU Allocation: Allocate 4-8 CPU cores per GPU for optimal performance
  3. Shared Memory: Set to 50-70% of total GPU memory for model caching
  4. DRA for Production: Use DRA with attribute selectors for consistent GPU allocation

Resource Limits vs Requests

  • Requests: Guaranteed resources for scheduling
  • Limits: Maximum resources the container can use
resources:
  requests:  # Guaranteed minimum
    memory: 32Gi
    cpu: 8
  limits:    # Maximum allowed
    memory: 64Gi
    cpu: 16
    nvidia.com/gpu: 2

GPU Selection Strategies

attributeSelectors:
- key: productName
  op: Equal
  value:
    stringValue: "NVIDIA-A100-SXM4-80GB"

Troubleshooting

Pod Pending Due to Insufficient Resources

Check resource availability:
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

DRA Claim Not Satisfied

View claim status:
kubectl get resourceclaim -n <namespace>
kubectl describe resourceclaim <claim-name> -n <namespace>

GPU Not Detected

Verify NVIDIA device plugin or DRA driver is running:
kubectl get pods -n nvidia-device-plugin
kubectl get pods -n nvidia-dra-driver

Build docs developers (and LLMs) love