Overview
The NIMService custom resource is the primary resource for deploying NVIDIA NIM (NVIDIA Inference Microservices) on Kubernetes. It provides a declarative way to configure model serving workloads with support for autoscaling, multi-platform deployment, and advanced GPU configurations.
What is NIMService?
NIMService manages the complete lifecycle of a NIM deployment, including:
Container image and runtime configuration
Model storage and caching via NIMCache integration
Resource allocation (CPU, memory, GPU)
Service exposure (ClusterIP, LoadBalancer, Ingress)
Horizontal pod autoscaling
Health checks and probes
Multi-node GPU deployments
Basic Example
Here’s a minimal NIMService configuration for deploying a Llama model:
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
image :
repository : nvcr.io/nim/meta/llama-3.2-1b-instruct
tag : "1.12.0"
pullPolicy : IfNotPresent
pullSecrets :
- ngc-secret
authSecret : ngc-api-secret
storage :
nimCache :
name : meta-llama-3-2-1b-instruct
profile : ''
replicas : 1
resources :
limits :
nvidia.com/gpu : 1
expose :
service :
type : ClusterIP
port : 8000
Core Configuration Fields
Image Configuration
Container image configuration for the NIM service. Container image repository (e.g., nvcr.io/nim/meta/llama-3.2-1b-instruct)
Image tag or version (e.g., "1.12.0")
pullPolicy
string
default: "IfNotPresent"
Image pull policy: Always, IfNotPresent, or Never
List of image pull secret names for accessing private registries
Authentication
Name of the Kubernetes secret containing NGC_API_KEY for authenticating with NVIDIA NGC.
Storage Configuration
Storage configuration for model caching and runtime data. Reference to a NIMCache resource for pre-cached models. Name of the NIMCache resource
Specific model profile to use from the cache
PersistentVolumeClaim configuration for model storage. Whether to create a new PVC
Name of existing or new PVC
Size of the PVC (e.g., "50Gi")
Access mode: ReadWriteOnce, ReadWriteMany, or ReadOnlyMany
Host path for model storage (not recommended for production)
EmptyDir volume configuration for ephemeral storage.
Size limit for shared memory volume (e.g., "1Gi"). Used for fast model runtime I/O.
Mount the storage volume as read-only
Resource Requirements
CPU, memory, and GPU resource requirements. Maximum resources allowed. Example: {"nvidia.com/gpu": 1, "memory": "32Gi"}
Minimum resources required. Example: {"nvidia.com/gpu": 1, "cpu": "4"}
For DRA (Dynamic Resource Allocation) GPU claims, use spec.draResources instead of traditional resource requests.
Service Exposure
Configuration for exposing the NIM service. Kubernetes Service configuration. type
string
default: "ClusterIP"
Service type: ClusterIP, NodePort, or LoadBalancer
Main API serving port (1-65535)
GRPC serving port for Triton-based NIMs
Custom annotations for the Service resource
Router configuration for Ingress or Gateway API. Ingress controller configuration. Ingress class to use (e.g., nginx, traefik)
Secret containing TLS certificate
Gateway API configuration for HTTPRoute/GRPCRoute. Namespace of the Gateway resource
Name of the Gateway resource
Enable HTTPRoute creation
Enable GRPCRoute creation
Domain name for constructing hostnames (e.g., example.com for service.namespace.example.com)
Replicas and Scaling
Number of pod replicas. Cannot be set when autoscaling is enabled.
Advanced Examples
With Autoscaling
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
image :
repository : nvcr.io/nim/meta/llama-3.2-1b-instruct
tag : "1.12.0"
pullPolicy : IfNotPresent
pullSecrets :
- ngc-secret
authSecret : ngc-api-secret
storage :
nimCache :
name : meta-llama-3-2-1b-instruct
resources :
limits :
nvidia.com/gpu : 1
expose :
service :
type : ClusterIP
port : 8000
metrics :
enabled : true
serviceMonitor :
additionalLabels :
release : kube-prometheus-stack
scale :
enabled : true
hpa :
minReplicas : 1
maxReplicas : 2
metrics :
- type : Object
object :
metric :
name : gpu_cache_usage_perc
describedObject :
apiVersion : v1
kind : Service
name : meta-llama-3-2-1b-instruct
target :
type : Value
value : "0.5"
scale :
enabled : true
hpa :
minReplicas : 1
maxReplicas : 3
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 80
Autoscaling requires Prometheus to be deployed for GPU-based metrics. For CPU/memory metrics, the metrics-server is required.
With Ingress
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
image :
repository : nvcr.io/nim/meta/llama-3.2-1b-instruct
tag : "1.12.0"
pullSecrets :
- ngc-secret
authSecret : ngc-api-secret
storage :
nimCache :
name : meta-llama-3-2-1b-instruct
resources :
limits :
nvidia.com/gpu : 1
expose :
service :
type : ClusterIP
port : 8000
router :
hostDomainName : example.com
ingress :
ingressClass : nginx
tlsSecretName : nim-tls-cert
This creates an ingress with hostname: meta-llama-3-2-1b-instruct.nim-service.example.com
The default platform deploys NIM as a standard Kubernetes Deployment.
spec :
inferencePlatform : standalone # default, can be omitted
# ... rest of configuration
Deploy NIM as a KServe InferenceService for advanced inference features.
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
inferencePlatform : kserve
annotations :
serving.kserve.io/deploymentMode : 'Standard'
labels :
networking.kserve.io/visibility : "exposed"
image :
repository : nvcr.io/nim/meta/llama-3.2-1b-instruct
tag : "1.12.0"
pullSecrets :
- ngc-secret
authSecret : ngc-api-secret
storage :
nimCache :
name : meta-llama-3-2-1b-instruct
resources :
limits :
nvidia.com/gpu : 1
cpu : "12"
memory : 32Gi
requests :
nvidia.com/gpu : 1
cpu : "4"
memory : 6Gi
expose :
service :
type : ClusterIP
port : 8000
scale :
enabled : true
hpa :
minReplicas : 1
maxReplicas : 3
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 80
KServe must be installed in your cluster before deploying NIMService with inferencePlatform: kserve.
Autoscaling Configuration
Horizontal Pod Autoscaler configuration. Enable autoscaling for the NIMService
HPA specifications. Minimum number of replicas (≥1)
Maximum number of replicas
List of metrics to use for scaling decisions. Supports Resource, Object, Pods, and External metric types.
Scaling behavior policies for scale up/down
Annotations for the HPA resource
Autoscaling cannot be enabled when spec.multiNode is configured
When autoscaling is enabled, spec.replicas cannot be set
Health Probes
NIMService supports customizable liveness, readiness, and startup probes.
Default Probes
Readiness Probe
Liveness Probe
Startup Probe
readinessProbe :
probe :
httpGet :
path : /v1/health/ready
port : api
initialDelaySeconds : 15
periodSeconds : 10
timeoutSeconds : 1
successThreshold : 1
failureThreshold : 3
Custom Probe Example
spec :
readinessProbe :
enabled : true
probe :
httpGet :
path : /v1/health/ready
port : api
initialDelaySeconds : 15
periodSeconds : 10
startupProbe :
enabled : true
probe :
httpGet :
path : /v1/health/ready
port : api
initialDelaySeconds : 900 # 15 minutes for large models
periodSeconds : 10
failureThreshold : 100
For large models that take significant time to load, increase the startupProbe.failureThreshold and initialDelaySeconds.
Additional Configuration
Environment Variables
Custom environment variables for the NIM container.
spec :
env :
- name : NIM_USE_SGLANG
value : "1"
- name : HF_HOME
value : /model-store/huggingface/hub
- name : NIM_TRUST_CUSTOM_CODE
value : "1"
Node Scheduling
Node Selector
Tolerations
Affinity
spec :
nodeSelector :
node.kubernetes.io/instance-type : g5.12xlarge
topology.kubernetes.io/zone : us-west-2a
spec :
tolerations :
- key : nvidia.com/gpu
operator : Exists
effect : NoSchedule
spec :
affinity :
nodeAffinity :
requiredDuringSchedulingIgnoredDuringExecution :
nodeSelectorTerms :
- matchExpressions :
- key : nvidia.com/gpu.product
operator : In
values :
- NVIDIA-A100-SXM4-80GB
Runtime Configuration
Runtime class for the pods (e.g., nvidia for GPU containers)
Custom scheduler name for pod scheduling
User ID for the container process
Group ID for the container process
Proxy Configuration
spec :
proxy :
httpProxy : http://proxy.example.com:8080
httpsProxy : https://proxy.example.com:8443
noProxy : localhost,127.0.0.1,.svc,.cluster.local
certConfigMap : custom-ca-bundle
Status Fields
The NIMService status provides information about the deployment state:
status :
state : Ready # Pending, NotReady, Ready, or Failed
availableReplicas : 1
conditions :
- type : NIM_SERVICE_READY
status : "True"
lastTransitionTime : "2024-03-03T10:15:30Z"
model :
name : meta-llama-3-2-1b-instruct
clusterEndpoint : meta-llama-3-2-1b-instruct.nim-service.svc.cluster.local:8000
externalEndpoint : meta-llama-3-2-1b-instruct.nim-service.example.com
Best Practices
Pre-cache Models
Always use NIMCache resources to pre-download and cache models before deploying NIMService. This significantly reduces startup time.
Right-size Resources
Allocate appropriate CPU, memory, and GPU resources based on your model size and expected throughput. Monitor resource usage and adjust accordingly.
Configure Health Probes
Customize startup probes for large models that require extended initialization time. Use readiness probes to ensure traffic is only sent to ready pods.
Use Persistent Storage
For production deployments, use PersistentVolumeClaims with appropriate access modes (ReadWriteMany for multi-node) instead of hostPath or emptyDir.
Enable Monitoring
Configure metrics and ServiceMonitor for production observability. Use custom metrics for intelligent autoscaling decisions.