Overview
NIMService is the primary resource for deploying NVIDIA Inference Microservices (NIMs) on Kubernetes. It manages the lifecycle of NIM deployments, including scaling, networking, storage, and multi-node configurations.
API Group: apps.nvidia.com
API Version: v1alpha1
Kind: NIMService
Spec Fields
Image Configuration
Container image configuration for the NIM service. Container image repository (e.g., nvcr.io/nim/meta/llama3-8b-instruct)
Image pull policy. Valid values: Always, IfNotPresent, Never
Names of Kubernetes secrets for pulling private images
Authentication
Name of an existing Kubernetes secret containing the NGC_API_KEY for authenticating with NGC
Container Overrides
Override the container’s entrypoint command
Arguments to pass to the container command
Additional environment variables to set in the NIM container. Merged with standard environment variables.
Storage
Storage configuration for caching NIM models Reference to a NIMCache resource for model storage Show NIMCacheVolSpec fields
spec.storage.nimCache.name
Name of the NIMCache resource
spec.storage.nimCache.profile
Specific model profile to use from the NIMCache
PersistentVolumeClaim for model storage Whether to create a new PVC (true) or use an existing one (false)
Name of the PVC. Required if create is false
spec.storage.pvc.storageClass
StorageClass to use for PVC creation
Size of the PVC (e.g., 50Gi)
spec.storage.pvc.volumeAccessMode
Volume access mode (e.g., ReadWriteOnce, ReadWriteMany)
Path inside the PVC to mount
spec.storage.pvc.annotations
Annotations to add to the PVC
Host path for model storage (deprecated, use PVC instead)
EmptyDir volume for ephemeral model storage spec.storage.emptyDir.sizeLimit
Size limit for the emptyDir volume
spec.storage.sharedMemorySizeLimit
Maximum size of the shared memory volume (e.g., 1Gi). Used for fast model I/O.
Whether to mount the storage volume as read-only
Scheduling
Additional labels to apply to NIMService pods
Additional annotations to apply to NIMService pods
Node selector labels for pod scheduling
Tolerations for pod scheduling
Affinity rules for pod scheduling
Deprecated: Use spec.affinity instead
Name of the scheduler to use for pod scheduling
Resources
CPU, memory, and GPU resource requirements. Traditional resources and device plugin resources are supported here. Minimum resource requirements (e.g., {"nvidia.com/gpu": "1", "memory": "16Gi"})
Dynamic Resource Allocation (DRA) resource claims. This field is immutable after creation. spec.draResources[].resourceClaimName
Name of an existing ResourceClaim in the same namespace. Mutually exclusive with resourceClaimTemplateName and claimCreationSpec. Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$
spec.draResources[].resourceClaimTemplateName
Name of a ResourceClaimTemplate to create claims from. Mutually exclusive with resourceClaimName and claimCreationSpec. Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$
spec.draResources[].claimCreationSpec
Spec to auto-generate a ResourceClaimTemplate. Mutually exclusive with resourceClaimName and resourceClaimTemplateName. Show DRAClaimCreationSpec fields
spec.draResources[].claimCreationSpec.generateName
Name prefix for generated ResourceClaimTemplate (1-16 characters)
spec.draResources[].claimCreationSpec.devices
Device specifications (minimum 1 device required) Show DRADeviceSpec fields
Name of the device request (DNS_LABEL format)
Number of devices to request
devices[].deviceClassName
string
default: "gpu.nvidia.com"
DeviceClass to inherit configuration from
devices[].driverName
string
default: "gpu.nvidia.com"
DRA driver name (DNS subdomain format)
devices[].attributeSelectors
[]DRADeviceAttributeSelector
Attribute-based device selection criteria (max 20)
devices[].capacitySelectors
[]DRAResourceQuantitySelector
Capacity-based device selection criteria (max 12)
CEL expressions for device selection. Cannot be used with attributeSelectors or capacitySelectors.
spec.draResources[].requests
Subset of requests from the claim to make available. If empty, all requests are available.
Networking
Service exposure configuration Service configuration Service type (e.g., ClusterIP, NodePort, LoadBalancer)
Override the default service name
Main API serving port (1-65535)
spec.expose.service.grpcPort
GRPC serving port for Triton-based NIMs (1-65535)
spec.expose.service.metricsPort
Metrics endpoint port for Triton-based NIMs (1-65535)
spec.expose.service.annotations
Service annotations
Router configuration for ingress or gateway spec.expose.router.hostDomainName
Domain name for constructing hostnames. Pattern: ^(([a-z0-9][a-z0-9\-]*[a-z0-9])|[a-z0-9]+\.)*([a-z]+|xn\-\-[a-z0-9]+)\.?$
spec.expose.router.annotations
Router annotations
spec.expose.router.ingress
Ingress controller configuration. Mutually exclusive with gateway. Show RouterIngress fields
spec.expose.router.ingress.ingressClass
Ingress class to use
spec.expose.router.ingress.tlsSecretName
Name of the TLS certificate secret
spec.expose.router.gateway
Gateway API configuration. Mutually exclusive with ingress. spec.expose.router.gateway.namespace
Gateway namespace
spec.expose.router.gateway.name
Gateway name
spec.expose.router.gateway.httpRoutesEnabled
Enable HTTPRoutes
spec.expose.router.gateway.grpcRoutesEnabled
Enable GRPCRoutes
spec.expose.router.gateway.backendRef
Backend reference to forward requests to
spec.expose.router.eppConfig
Endpoint Picker Extension configuration (standalone platform only)
Deprecated: Use spec.expose.router.ingress instead
Health Probes
Liveness probe configuration spec.livenessProbe.enabled
Whether to enable the probe (default: true)
Kubernetes probe configuration. If not specified, defaults to HTTP GET on /v1/health/live
Readiness probe configuration. Defaults to HTTP GET on /v1/health/ready
Startup probe configuration. Defaults to HTTP GET on /v1/health/ready with 120 failure threshold
Scaling
Number of replicas. Minimum: 0. Cannot be set when spec.scale.enabled is true.
Horizontal Pod Autoscaler configuration. Cannot be enabled when multiNode is set. spec.scale.hpa
HorizontalPodAutoscalerSpec
HPA specification spec.scale.hpa.minReplicas
Minimum number of replicas (minimum: 1)
spec.scale.hpa.maxReplicas
Maximum number of replicas
Metrics to use for scaling decisions
spec.scale.hpa.behavior
HorizontalPodAutoscalerBehavior
Scaling behavior configuration
Monitoring
Metrics collection configuration Enable ServiceMonitor creation for Prometheus Operator
spec.metrics.serviceMonitor
ServiceMonitor configuration Show ServiceMonitor fields
spec.metrics.serviceMonitor.additionalLabels
Additional labels for the ServiceMonitor
spec.metrics.serviceMonitor.annotations
ServiceMonitor annotations
spec.metrics.serviceMonitor.interval
Scrape interval
spec.metrics.serviceMonitor.scrapeTimeout
Scrape timeout
Security
User ID to run the container as (default: 1000)
Group ID to run the container as (default: 2000)
RuntimeClass to use for the pods
Proxy Configuration
HTTP/HTTPS proxy configuration Comma-separated list of hosts to exclude from proxying
Name of ConfigMap containing custom CA certificates
Multi-Node Configuration
spec.multiNode
NimServiceMultiNodeConfig
Multi-node deployment configuration using LeaderWorkerSet. Cannot be used with autoscaling. spec.multiNode.backendType
Backend type for multi-node deployment. Valid values: lws
spec.multiNode.parallelism
Parallelism configuration Show ParallelismSpec fields
spec.multiNode.parallelism.pipeline
Pipeline parallelism size (minimum: 1)
spec.multiNode.parallelism.tensor
Tensor parallelism size (minimum: 1)
MPI configuration for LeaderWorkerSet spec.multiNode.mpi.mpiStartTimeout
Timeout in seconds for starting the MPI cluster
spec.multiNode.computeDomain
ComputeDomain configuration for NVLink-enabled nodes Show ComputeDomain fields
spec.multiNode.computeDomain.create
Whether to create a new ComputeDomain. If false, name must be specified.
spec.multiNode.computeDomain.name
Name of an existing ComputeDomain (required if create is false)
spec.inferencePlatform
string
default: "standalone"
Inference platform to use. Valid values: standalone, kserve
Init and Sidecar Containers
Init containers to run before the main NIM container Show NIMContainerSpec fields
spec.initContainers[].name
Container name
spec.initContainers[].image
Container image
spec.initContainers[].command
Container command
spec.initContainers[].args
Container arguments
spec.initContainers[].env
Environment variables
spec.initContainers[].workingDir
Working directory
Sidecar containers to run alongside the main NIM container
Status Fields
Standard Kubernetes conditions for the NIMService
Number of available replicas
Current state of the NIMService. Values: Pending, NotReady, Ready, Failed
Model endpoint information status.model.clusterEndpoint
Internal cluster endpoint for the model
status.model.externalEndpoint
External endpoint for the model
status.draResourceStatuses
Status of DRA resources (list indexed by name) Show DRAResourceStatus fields
status.draResourceStatuses[].name
Pod claim name referenced in the pod spec
status.draResourceStatuses[].resourceClaimStatus
DRAResourceClaimStatusInfo
Status of a directly referenced ResourceClaim Show ResourceClaimStatus fields
Name of the ResourceClaim
State: pending, deleted, allocated, or reserved
status.draResourceStatuses[].resourceClaimTemplateStatus
DRAResourceClaimTemplateStatusInfo
Status of a ResourceClaimTemplate Show ResourceClaimTemplateStatus fields
Name of the ResourceClaimTemplate
resourceClaimStatuses
[]DRAResourceClaimStatusInfo
Statuses of generated ResourceClaims
status.computeDomainStatus
Status of the ComputeDomain for multi-node deployments Show ComputeDomainStatus fields
status.computeDomainStatus.name
ComputeDomain name
status.computeDomainStatus.status
ComputeDomain status
status.computeDomainStatus.nodes
[]ComputeDomainNodeStatus
Status of nodes in the ComputeDomain IMEX daemon status: Ready or NotReady
Example
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : meta-llama3-8b-instruct
namespace : nim-service
spec :
# Image configuration
image :
repository : nvcr.io/nim/meta/llama3-8b-instruct
tag : "1.0.0"
pullPolicy : IfNotPresent
pullSecrets :
- ngc-secret
# Authentication
authSecret : ngc-api-key
# Storage
storage :
pvc :
create : true
storageClass : standard
size : 50Gi
volumeAccessMode : ReadWriteOnce
sharedMemorySizeLimit : 1Gi
# Resources
resources :
requests :
nvidia.com/gpu : "1"
memory : 16Gi
cpu : "4"
limits :
nvidia.com/gpu : "1"
memory : 16Gi
# Scheduling
nodeSelector :
nvidia.com/gpu.product : NVIDIA-A100-SXM4-40GB
tolerations :
- key : nvidia.com/gpu
operator : Exists
effect : NoSchedule
# Networking
expose :
service :
type : ClusterIP
port : 8000
router :
hostDomainName : example.com
ingress :
ingressClass : nginx
tlsSecretName : nim-tls
# Scaling
replicas : 2
# Health probes
livenessProbe :
enabled : true
readinessProbe :
enabled : true
startupProbe :
enabled : true
# Monitoring
metrics :
enabled : true
serviceMonitor :
interval : 30s