Skip to main content

Overview

NIMService is the primary resource for deploying NVIDIA Inference Microservices (NIMs) on Kubernetes. It manages the lifecycle of NIM deployments, including scaling, networking, storage, and multi-node configurations. API Group: apps.nvidia.com
API Version: v1alpha1
Kind: NIMService

Spec Fields

Image Configuration

spec.image
Image
required
Container image configuration for the NIM service.

Authentication

spec.authSecret
string
required
Name of an existing Kubernetes secret containing the NGC_API_KEY for authenticating with NGC

Container Overrides

spec.command
[]string
Override the container’s entrypoint command
spec.args
[]string
Arguments to pass to the container command
spec.env
[]EnvVar
Additional environment variables to set in the NIM container. Merged with standard environment variables.

Storage

spec.storage
NIMServiceStorage
Storage configuration for caching NIM models

Scheduling

spec.labels
map[string]string
Additional labels to apply to NIMService pods
spec.annotations
map[string]string
Additional annotations to apply to NIMService pods
spec.nodeSelector
map[string]string
Node selector labels for pod scheduling
spec.tolerations
[]Toleration
Tolerations for pod scheduling
spec.affinity
Affinity
Affinity rules for pod scheduling
spec.podAffinity
PodAffinity
Deprecated: Use spec.affinity instead
spec.schedulerName
string
Name of the scheduler to use for pod scheduling

Resources

spec.resources
ResourceRequirements
CPU, memory, and GPU resource requirements. Traditional resources and device plugin resources are supported here.
spec.draResources
[]DRAResource
Dynamic Resource Allocation (DRA) resource claims. This field is immutable after creation.

Networking

spec.expose
Expose
Service exposure configuration

Health Probes

spec.livenessProbe
Probe
Liveness probe configuration
spec.readinessProbe
Probe
Readiness probe configuration. Defaults to HTTP GET on /v1/health/ready
spec.startupProbe
Probe
Startup probe configuration. Defaults to HTTP GET on /v1/health/ready with 120 failure threshold

Scaling

spec.replicas
int32
Number of replicas. Minimum: 0. Cannot be set when spec.scale.enabled is true.
spec.scale
Autoscaling
Horizontal Pod Autoscaler configuration. Cannot be enabled when multiNode is set.

Monitoring

spec.metrics
Metrics
Metrics collection configuration

Security

spec.userID
int64
User ID to run the container as (default: 1000)
spec.groupID
int64
Group ID to run the container as (default: 2000)
spec.runtimeClassName
string
RuntimeClass to use for the pods

Proxy Configuration

spec.proxy
ProxySpec
HTTP/HTTPS proxy configuration

Multi-Node Configuration

spec.multiNode
NimServiceMultiNodeConfig
Multi-node deployment configuration using LeaderWorkerSet. Cannot be used with autoscaling.

Inference Platform

spec.inferencePlatform
string
default:"standalone"
Inference platform to use. Valid values: standalone, kserve

Init and Sidecar Containers

spec.initContainers
[]NIMContainerSpec
Init containers to run before the main NIM container
spec.sidecarContainers
[]NIMContainerSpec
Sidecar containers to run alongside the main NIM container

Status Fields

status.conditions
[]Condition
Standard Kubernetes conditions for the NIMService
status.availableReplicas
int32
Number of available replicas
status.state
string
Current state of the NIMService. Values: Pending, NotReady, Ready, Failed
status.model
ModelStatus
Model endpoint information
status.draResourceStatuses
[]DRAResourceStatus
Status of DRA resources (list indexed by name)
status.computeDomainStatus
ComputeDomainStatus
Status of the ComputeDomain for multi-node deployments

Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
  namespace: nim-service
spec:
  # Image configuration
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: "1.0.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  
  # Authentication
  authSecret: ngc-api-key
  
  # Storage
  storage:
    pvc:
      create: true
      storageClass: standard
      size: 50Gi
      volumeAccessMode: ReadWriteOnce
    sharedMemorySizeLimit: 1Gi
  
  # Resources
  resources:
    requests:
      nvidia.com/gpu: "1"
      memory: 16Gi
      cpu: "4"
    limits:
      nvidia.com/gpu: "1"
      memory: 16Gi
  
  # Scheduling
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
  
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  
  # Networking
  expose:
    service:
      type: ClusterIP
      port: 8000
    router:
      hostDomainName: example.com
      ingress:
        ingressClass: nginx
        tlsSecretName: nim-tls
  
  # Scaling
  replicas: 2
  
  # Health probes
  livenessProbe:
    enabled: true
  readinessProbe:
    enabled: true
  startupProbe:
    enabled: true
  
  # Monitoring
  metrics:
    enabled: true
    serviceMonitor:
      interval: 30s

Build docs developers (and LLMs) love