Skip to main content

Overview

The NIMService custom resource is the primary resource for deploying NVIDIA NIM (NVIDIA Inference Microservices) on Kubernetes. It provides a declarative way to configure model serving workloads with support for autoscaling, multi-platform deployment, and advanced GPU configurations.

What is NIMService?

NIMService manages the complete lifecycle of a NIM deployment, including:
  • Container image and runtime configuration
  • Model storage and caching via NIMCache integration
  • Resource allocation (CPU, memory, GPU)
  • Service exposure (ClusterIP, LoadBalancer, Ingress)
  • Horizontal pod autoscaling
  • Health checks and probes
  • Multi-node GPU deployments

Basic Example

Here’s a minimal NIMService configuration for deploying a Llama model:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Core Configuration Fields

Image Configuration

spec.image
object
required
Container image configuration for the NIM service.

Authentication

spec.authSecret
string
required
Name of the Kubernetes secret containing NGC_API_KEY for authenticating with NVIDIA NGC.

Storage Configuration

spec.storage
object
Storage configuration for model caching and runtime data.

Resource Requirements

spec.resources
object
CPU, memory, and GPU resource requirements.
For DRA (Dynamic Resource Allocation) GPU claims, use spec.draResources instead of traditional resource requests.

Service Exposure

spec.expose
object
Configuration for exposing the NIM service.

Replicas and Scaling

spec.replicas
integer
default:"1"
Number of pod replicas. Cannot be set when autoscaling is enabled.

Advanced Examples

With Autoscaling

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 2
      metrics:
      - type: Object
        object:
          metric:
            name: gpu_cache_usage_perc
          describedObject:
            apiVersion: v1
            kind: Service
            name: meta-llama-3-2-1b-instruct
          target:
            type: Value
            value: "0.5"
Autoscaling requires Prometheus to be deployed for GPU-based metrics. For CPU/memory metrics, the metrics-server is required.

With Ingress

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
    router:
      hostDomainName: example.com
      ingress:
        ingressClass: nginx
        tlsSecretName: nim-tls-cert
This creates an ingress with hostname: meta-llama-3-2-1b-instruct.nim-service.example.com

Inference Platforms

Standalone Platform (Default)

The default platform deploys NIM as a standard Kubernetes Deployment.
spec:
  inferencePlatform: standalone  # default, can be omitted
  # ... rest of configuration

KServe Platform

Deploy NIM as a KServe InferenceService for advanced inference features.
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  inferencePlatform: kserve
  annotations:
    serving.kserve.io/deploymentMode: 'Standard'
  labels:
    networking.kserve.io/visibility: "exposed"
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
  resources:
    limits:
      nvidia.com/gpu: 1
      cpu: "12"
      memory: 32Gi
    requests:
      nvidia.com/gpu: 1
      cpu: "4"
      memory: 6Gi
  expose:
    service:
      type: ClusterIP
      port: 8000
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80
KServe must be installed in your cluster before deploying NIMService with inferencePlatform: kserve.

Autoscaling Configuration

spec.scale
object
Horizontal Pod Autoscaler configuration.
  • Autoscaling cannot be enabled when spec.multiNode is configured
  • When autoscaling is enabled, spec.replicas cannot be set

Health Probes

NIMService supports customizable liveness, readiness, and startup probes.

Default Probes

readinessProbe:
  probe:
    httpGet:
      path: /v1/health/ready
      port: api
    initialDelaySeconds: 15
    periodSeconds: 10
    timeoutSeconds: 1
    successThreshold: 1
    failureThreshold: 3

Custom Probe Example

spec:
  readinessProbe:
    enabled: true
    probe:
      httpGet:
        path: /v1/health/ready
        port: api
      initialDelaySeconds: 15
      periodSeconds: 10
  startupProbe:
    enabled: true
    probe:
      httpGet:
        path: /v1/health/ready
        port: api
      initialDelaySeconds: 900  # 15 minutes for large models
      periodSeconds: 10
      failureThreshold: 100
For large models that take significant time to load, increase the startupProbe.failureThreshold and initialDelaySeconds.

Additional Configuration

Environment Variables

spec.env
array
Custom environment variables for the NIM container.
spec:
  env:
  - name: NIM_USE_SGLANG
    value: "1"
  - name: HF_HOME
    value: /model-store/huggingface/hub
  - name: NIM_TRUST_CUSTOM_CODE
    value: "1"

Node Scheduling

spec:
  nodeSelector:
    node.kubernetes.io/instance-type: g5.12xlarge
    topology.kubernetes.io/zone: us-west-2a

Runtime Configuration

spec.runtimeClassName
string
Runtime class for the pods (e.g., nvidia for GPU containers)
spec.schedulerName
string
Custom scheduler name for pod scheduling
spec.userID
integer
default:"1000"
User ID for the container process
spec.groupID
integer
default:"2000"
Group ID for the container process

Proxy Configuration

spec:
  proxy:
    httpProxy: http://proxy.example.com:8080
    httpsProxy: https://proxy.example.com:8443
    noProxy: localhost,127.0.0.1,.svc,.cluster.local
    certConfigMap: custom-ca-bundle

Status Fields

The NIMService status provides information about the deployment state:
status:
  state: Ready  # Pending, NotReady, Ready, or Failed
  availableReplicas: 1
  conditions:
  - type: NIM_SERVICE_READY
    status: "True"
    lastTransitionTime: "2024-03-03T10:15:30Z"
  model:
    name: meta-llama-3-2-1b-instruct
    clusterEndpoint: meta-llama-3-2-1b-instruct.nim-service.svc.cluster.local:8000
    externalEndpoint: meta-llama-3-2-1b-instruct.nim-service.example.com

Best Practices

1

Pre-cache Models

Always use NIMCache resources to pre-download and cache models before deploying NIMService. This significantly reduces startup time.
2

Right-size Resources

Allocate appropriate CPU, memory, and GPU resources based on your model size and expected throughput. Monitor resource usage and adjust accordingly.
3

Configure Health Probes

Customize startup probes for large models that require extended initialization time. Use readiness probes to ensure traffic is only sent to ready pods.
4

Use Persistent Storage

For production deployments, use PersistentVolumeClaims with appropriate access modes (ReadWriteMany for multi-node) instead of hostPath or emptyDir.
5

Enable Monitoring

Configure metrics and ServiceMonitor for production observability. Use custom metrics for intelligent autoscaling decisions.

Build docs developers (and LLMs) love