NIMService

Overview

NIMService is the primary resource for deploying NVIDIA Inference Microservices (NIMs) on Kubernetes. It manages the lifecycle of NIM deployments, including scaling, networking, storage, and multi-node configurations. API Group: apps.nvidia.com
API Version: v1alpha1
Kind: NIMService

Spec Fields

Image Configuration

spec.image

Image

required

Container image configuration for the NIM service.

Show Image fields

spec.image.repository

string

required

Container image repository (e.g., nvcr.io/nim/meta/llama3-8b-instruct)

spec.image.tag

string

required

Container image tag

spec.image.pullPolicy

string

Image pull policy. Valid values: Always, IfNotPresent, Never

spec.image.pullSecrets

[]string

Names of Kubernetes secrets for pulling private images

Authentication

spec.authSecret

string

required

Name of an existing Kubernetes secret containing the NGC_API_KEY for authenticating with NGC

Container Overrides

spec.command

[]string

Override the container’s entrypoint command

spec.args

[]string

Arguments to pass to the container command

spec.env

[]EnvVar

Additional environment variables to set in the NIM container. Merged with standard environment variables.

Storage

spec.storage

NIMServiceStorage

Storage configuration for caching NIM models

Show Storage fields

spec.storage.nimCache

NIMCacheVolSpec

Reference to a NIMCache resource for model storage

Show NIMCacheVolSpec fields

spec.storage.nimCache.name

string

Name of the NIMCache resource

spec.storage.nimCache.profile

string

Specific model profile to use from the NIMCache

spec.storage.pvc

PersistentVolumeClaim

PersistentVolumeClaim for model storage

Show PVC fields

spec.storage.pvc.create

boolean

Whether to create a new PVC (true) or use an existing one (false)

spec.storage.pvc.name

string

Name of the PVC. Required if create is false

spec.storage.pvc.storageClass

string

StorageClass to use for PVC creation

spec.storage.pvc.size

string

Size of the PVC (e.g., 50Gi)

spec.storage.pvc.volumeAccessMode

string

Volume access mode (e.g., ReadWriteOnce, ReadWriteMany)

spec.storage.pvc.subPath

string

Path inside the PVC to mount

spec.storage.pvc.annotations

map[string]string

Annotations to add to the PVC

spec.storage.hostPath

string

Host path for model storage (deprecated, use PVC instead)

spec.storage.emptyDir

EmptyDirSpec

EmptyDir volume for ephemeral model storage

Show EmptyDir fields

spec.storage.emptyDir.sizeLimit

Quantity

Size limit for the emptyDir volume

spec.storage.sharedMemorySizeLimit

Quantity

Maximum size of the shared memory volume (e.g., 1Gi). Used for fast model I/O.

spec.storage.readOnly

boolean

Whether to mount the storage volume as read-only

Scheduling

spec.labels

map[string]string

Additional labels to apply to NIMService pods

spec.annotations

map[string]string

Additional annotations to apply to NIMService pods

spec.nodeSelector

map[string]string

Node selector labels for pod scheduling

spec.tolerations

[]Toleration

Tolerations for pod scheduling

spec.affinity

Affinity

Affinity rules for pod scheduling

spec.podAffinity

PodAffinity

Deprecated: Use spec.affinity instead

spec.schedulerName

string

Name of the scheduler to use for pod scheduling

Resources

spec.resources

ResourceRequirements

CPU, memory, and GPU resource requirements. Traditional resources and device plugin resources are supported here.

Show Resource fields

spec.resources.requests

map[string]Quantity

Minimum resource requirements (e.g., {"nvidia.com/gpu": "1", "memory": "16Gi"})

spec.resources.limits

map[string]Quantity

Maximum resource limits

spec.draResources

[]DRAResource

Dynamic Resource Allocation (DRA) resource claims. This field is immutable after creation.

Show DRAResource fields

spec.draResources[].resourceClaimName

string

Name of an existing ResourceClaim in the same namespace. Mutually exclusive with resourceClaimTemplateName and claimCreationSpec. Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$

spec.draResources[].resourceClaimTemplateName

string

Name of a ResourceClaimTemplate to create claims from. Mutually exclusive with resourceClaimName and claimCreationSpec. Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$

spec.draResources[].claimCreationSpec

DRAClaimCreationSpec

Spec to auto-generate a ResourceClaimTemplate. Mutually exclusive with resourceClaimName and resourceClaimTemplateName.

Show DRAClaimCreationSpec fields

spec.draResources[].claimCreationSpec.generateName

string

Name prefix for generated ResourceClaimTemplate (1-16 characters)

spec.draResources[].claimCreationSpec.devices

[]DRADeviceSpec

required

Device specifications (minimum 1 device required)

Show DRADeviceSpec fields

devices[].name

string

required

Name of the device request (DNS_LABEL format)

devices[].count

uint32

default:"1"

Number of devices to request

devices[].deviceClassName

string

default:"gpu.nvidia.com"

DeviceClass to inherit configuration from

devices[].driverName

string

default:"gpu.nvidia.com"

DRA driver name (DNS subdomain format)

devices[].attributeSelectors

[]DRADeviceAttributeSelector

Attribute-based device selection criteria (max 20)

devices[].capacitySelectors

[]DRAResourceQuantitySelector

Capacity-based device selection criteria (max 12)

devices[].celExpressions

[]string

CEL expressions for device selection. Cannot be used with attributeSelectors or capacitySelectors.

spec.draResources[].requests

[]string

Subset of requests from the claim to make available. If empty, all requests are available.

Networking

spec.expose

Expose

Service exposure configuration

Show Expose fields

spec.expose.service

Service

Service configuration

Show Service fields

spec.expose.service.type

string

Service type (e.g., ClusterIP, NodePort, LoadBalancer)

spec.expose.service.name

string

Override the default service name

spec.expose.service.port

int32

default:"8000"

Main API serving port (1-65535)

spec.expose.service.grpcPort

int32

GRPC serving port for Triton-based NIMs (1-65535)

spec.expose.service.metricsPort

int32

Metrics endpoint port for Triton-based NIMs (1-65535)

spec.expose.service.annotations

map[string]string

Service annotations

spec.expose.router

Router

Router configuration for ingress or gateway

Show Router fields

spec.expose.router.hostDomainName

string

Domain name for constructing hostnames. Pattern: ^(([a-z0-9][a-z0-9\-]*[a-z0-9])|[a-z0-9]+\.)*([a-z]+|xn\-\-[a-z0-9]+)\.?$

spec.expose.router.annotations

map[string]string

Router annotations

spec.expose.router.ingress

RouterIngress

Ingress controller configuration. Mutually exclusive with gateway.

Show RouterIngress fields

spec.expose.router.ingress.ingressClass

string

required

Ingress class to use

spec.expose.router.ingress.tlsSecretName

string

Name of the TLS certificate secret

spec.expose.router.gateway

Gateway

Gateway API configuration. Mutually exclusive with ingress.

Show Gateway fields

spec.expose.router.gateway.namespace

string

required

Gateway namespace

spec.expose.router.gateway.name

string

required

Gateway name

spec.expose.router.gateway.httpRoutesEnabled

boolean

default:"true"

Enable HTTPRoutes

spec.expose.router.gateway.grpcRoutesEnabled

boolean

default:"false"

Enable GRPCRoutes

spec.expose.router.gateway.backendRef

BackendRef

Backend reference to forward requests to

spec.expose.router.eppConfig

EPPConfig

Endpoint Picker Extension configuration (standalone platform only)

spec.expose.ingress

Ingress

Deprecated: Use spec.expose.router.ingress instead

Health Probes

spec.livenessProbe

Probe

Liveness probe configuration

Show Probe fields

spec.livenessProbe.enabled

boolean

Whether to enable the probe (default: true)

spec.livenessProbe.probe

corev1.Probe

Kubernetes probe configuration. If not specified, defaults to HTTP GET on /v1/health/live

spec.readinessProbe

Probe

Readiness probe configuration. Defaults to HTTP GET on /v1/health/ready

spec.startupProbe

Probe

Startup probe configuration. Defaults to HTTP GET on /v1/health/ready with 120 failure threshold

Scaling

spec.replicas

int32

Number of replicas. Minimum: 0. Cannot be set when spec.scale.enabled is true.

spec.scale

Autoscaling

Horizontal Pod Autoscaler configuration. Cannot be enabled when multiNode is set.

Show Autoscaling fields

spec.scale.enabled

boolean

Enable autoscaling

spec.scale.annotations

map[string]string

HPA annotations

spec.scale.hpa

HorizontalPodAutoscalerSpec

HPA specification

Show HPA fields

spec.scale.hpa.minReplicas

int32

default:"1"

Minimum number of replicas (minimum: 1)

spec.scale.hpa.maxReplicas

int32

required

Maximum number of replicas

spec.scale.hpa.metrics

[]MetricSpec

Metrics to use for scaling decisions

spec.scale.hpa.behavior

HorizontalPodAutoscalerBehavior

Scaling behavior configuration

Monitoring

spec.metrics

Metrics

Metrics collection configuration

Show Metrics fields

spec.metrics.enabled

boolean

Enable ServiceMonitor creation for Prometheus Operator

spec.metrics.serviceMonitor

ServiceMonitor

ServiceMonitor configuration

Show ServiceMonitor fields

spec.metrics.serviceMonitor.additionalLabels

map[string]string

Additional labels for the ServiceMonitor

spec.metrics.serviceMonitor.annotations

map[string]string

ServiceMonitor annotations

spec.metrics.serviceMonitor.interval

Duration

Scrape interval

spec.metrics.serviceMonitor.scrapeTimeout

Duration

Scrape timeout

Security

spec.userID

int64

User ID to run the container as (default: 1000)

spec.groupID

int64

Group ID to run the container as (default: 2000)

spec.runtimeClassName

string

RuntimeClass to use for the pods

Proxy Configuration

spec.proxy

ProxySpec

HTTP/HTTPS proxy configuration

Show ProxySpec fields

spec.proxy.httpProxy

string

HTTP proxy URL

spec.proxy.httpsProxy

string

HTTPS proxy URL

spec.proxy.noProxy

string

Comma-separated list of hosts to exclude from proxying

spec.proxy.certConfigMap

string

Name of ConfigMap containing custom CA certificates

Multi-Node Configuration

spec.multiNode

NimServiceMultiNodeConfig

Multi-node deployment configuration using LeaderWorkerSet. Cannot be used with autoscaling.

Show Multi-node fields

spec.multiNode.backendType

string

default:"lws"

Backend type for multi-node deployment. Valid values: lws

spec.multiNode.parallelism

ParallelismSpec

required

Parallelism configuration

Show ParallelismSpec fields

spec.multiNode.parallelism.pipeline

uint32

Pipeline parallelism size (minimum: 1)

spec.multiNode.parallelism.tensor

uint32

Tensor parallelism size (minimum: 1)

spec.multiNode.mpi

MultiNodeMPIConfig

MPI configuration for LeaderWorkerSet

Show MPI fields

spec.multiNode.mpi.mpiStartTimeout

int

default:"300"

Timeout in seconds for starting the MPI cluster

spec.multiNode.computeDomain

ComputeDomain

ComputeDomain configuration for NVLink-enabled nodes

Show ComputeDomain fields

spec.multiNode.computeDomain.create

boolean

Whether to create a new ComputeDomain. If false, name must be specified.

spec.multiNode.computeDomain.name

string

Name of an existing ComputeDomain (required if create is false)

Inference Platform

spec.inferencePlatform

string

default:"standalone"

Inference platform to use. Valid values: standalone, kserve

Init and Sidecar Containers

spec.initContainers

[]NIMContainerSpec

Init containers to run before the main NIM container

Show NIMContainerSpec fields

spec.initContainers[].name

string

required

Container name

spec.initContainers[].image

Image

required

Container image

spec.initContainers[].command

[]string

Container command

spec.initContainers[].args

[]string

Container arguments

spec.initContainers[].env

[]EnvVar

Environment variables

spec.initContainers[].workingDir

string

Working directory

spec.sidecarContainers

[]NIMContainerSpec

Sidecar containers to run alongside the main NIM container

Status Fields

status.conditions

[]Condition

Standard Kubernetes conditions for the NIMService

status.availableReplicas

int32

Number of available replicas

status.state

string

Current state of the NIMService. Values: Pending, NotReady, Ready, Failed

status.model

ModelStatus

Model endpoint information

Show ModelStatus fields

status.model.name

string

Model name

status.model.clusterEndpoint

string

Internal cluster endpoint for the model

status.model.externalEndpoint

string

External endpoint for the model

status.draResourceStatuses

[]DRAResourceStatus

Status of DRA resources (list indexed by name)

Show DRAResourceStatus fields

status.draResourceStatuses[].name

string

Pod claim name referenced in the pod spec

status.draResourceStatuses[].resourceClaimStatus

DRAResourceClaimStatusInfo

Status of a directly referenced ResourceClaim

Show ResourceClaimStatus fields

name

string

Name of the ResourceClaim

state

string

default:"pending"

State: pending, deleted, allocated, or reserved

status.draResourceStatuses[].resourceClaimTemplateStatus

DRAResourceClaimTemplateStatusInfo

Status of a ResourceClaimTemplate

Show ResourceClaimTemplateStatus fields

name

string

Name of the ResourceClaimTemplate

resourceClaimStatuses

[]DRAResourceClaimStatusInfo

Statuses of generated ResourceClaims

status.computeDomainStatus

ComputeDomainStatus

Status of the ComputeDomain for multi-node deployments

Show ComputeDomainStatus fields

status.computeDomainStatus.name

string

ComputeDomain name

status.computeDomainStatus.status

string

ComputeDomain status

status.computeDomainStatus.nodes

[]ComputeDomainNodeStatus

Status of nodes in the ComputeDomain

Show NodeStatus fields

name

string

Node name

cliqueID

string

NVLink clique ID

status

string

default:"NotReady"

IMEX daemon status: Ready or NotReady

Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
  namespace: nim-service
spec:
  # Image configuration
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: "1.0.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  
  # Authentication
  authSecret: ngc-api-key
  
  # Storage
  storage:
    pvc:
      create: true
      storageClass: standard
      size: 50Gi
      volumeAccessMode: ReadWriteOnce
    sharedMemorySizeLimit: 1Gi
  
  # Resources
  resources:
    requests:
      nvidia.com/gpu: "1"
      memory: 16Gi
      cpu: "4"
    limits:
      nvidia.com/gpu: "1"
      memory: 16Gi
  
  # Scheduling
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
  
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  
  # Networking
  expose:
    service:
      type: ClusterIP
      port: 8000
    router:
      hostDomainName: example.com
      ingress:
        ingressClass: nginx
        tlsSecretName: nim-tls
  
  # Scaling
  replicas: 2
  
  # Health probes
  livenessProbe:
    enabled: true
  readinessProbe:
    enabled: true
  startupProbe:
    enabled: true
  
  # Monitoring
  metrics:
    enabled: true
    serviceMonitor:
      interval: 30s

NIM Resources

NeMo Resources

Overview

Spec Fields

Image Configuration

Authentication

Container Overrides

Storage

Scheduling

Resources

Networking

Health Probes

Scaling

Monitoring

Security

Proxy Configuration

Multi-Node Configuration

Inference Platform

Init and Sidecar Containers

Status Fields

Example

Build docs developers (and LLMs) love

NIM Resources

NeMo Resources

​Overview

​Spec Fields

​Image Configuration

​Authentication

​Container Overrides

​Storage

​Scheduling

​Resources

​Networking

​Health Probes

​Scaling

​Monitoring

​Security

​Proxy Configuration

​Multi-Node Configuration

​Inference Platform

​Init and Sidecar Containers

​Status Fields

​Example

Build docs developers (and LLMs) love

Overview

Spec Fields

Image Configuration

Authentication

Container Overrides

Storage

Scheduling

Resources

Networking

Health Probes

Scaling

Monitoring

Security

Proxy Configuration

Multi-Node Configuration

Inference Platform

Init and Sidecar Containers

Status Fields

Example