NIMService Resource

Overview

The NIMService custom resource is the primary resource for deploying NVIDIA NIM (NVIDIA Inference Microservices) on Kubernetes. It provides a declarative way to configure model serving workloads with support for autoscaling, multi-platform deployment, and advanced GPU configurations.

What is NIMService?

NIMService manages the complete lifecycle of a NIM deployment, including:

Container image and runtime configuration
Model storage and caching via NIMCache integration
Resource allocation (CPU, memory, GPU)
Service exposure (ClusterIP, LoadBalancer, Ingress)
Horizontal pod autoscaling
Health checks and probes
Multi-node GPU deployments

Basic Example

Here’s a minimal NIMService configuration for deploying a Llama model:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Core Configuration Fields

Image Configuration

spec.image

object

required

Container image configuration for the NIM service.

Show properties

repository

string

required

Container image repository (e.g., nvcr.io/nim/meta/llama-3.2-1b-instruct)

tag

string

required

Image tag or version (e.g., "1.12.0")

pullPolicy

string

default:"IfNotPresent"

Image pull policy: Always, IfNotPresent, or Never

pullSecrets

array

List of image pull secret names for accessing private registries

Authentication

spec.authSecret

string

required

Name of the Kubernetes secret containing NGC_API_KEY for authenticating with NVIDIA NGC.

Storage Configuration

spec.storage

object

Storage configuration for model caching and runtime data.

Show properties

nimCache

object

Reference to a NIMCache resource for pre-cached models.

Show properties

name

string

Name of the NIMCache resource

profile

string

Specific model profile to use from the cache

pvc

object

PersistentVolumeClaim configuration for model storage.

Show properties

create

boolean

Whether to create a new PVC

name

string

Name of existing or new PVC

size

string

Size of the PVC (e.g., "50Gi")

storageClass

string

Storage class name

volumeAccessMode

string

Access mode: ReadWriteOnce, ReadWriteMany, or ReadOnlyMany

hostPath

string

Host path for model storage (not recommended for production)

emptyDir

object

EmptyDir volume configuration for ephemeral storage.

sharedMemorySizeLimit

string

Size limit for shared memory volume (e.g., "1Gi"). Used for fast model runtime I/O.

readOnly

boolean

default:"false"

Mount the storage volume as read-only

Resource Requirements

spec.resources

object

CPU, memory, and GPU resource requirements.

Show properties

limits

object

Maximum resources allowed. Example: {"nvidia.com/gpu": 1, "memory": "32Gi"}

requests

object

Minimum resources required. Example: {"nvidia.com/gpu": 1, "cpu": "4"}

For DRA (Dynamic Resource Allocation) GPU claims, use spec.draResources instead of traditional resource requests.

Service Exposure

spec.expose

object

Configuration for exposing the NIM service.

Show properties

service

object

Kubernetes Service configuration.

Show properties

type

string

default:"ClusterIP"

Service type: ClusterIP, NodePort, or LoadBalancer

port

integer

default:"8000"

Main API serving port (1-65535)

grpcPort

integer

GRPC serving port for Triton-based NIMs

metricsPort

integer

Metrics endpoint port

annotations

object

Custom annotations for the Service resource

router

object

Router configuration for Ingress or Gateway API.

Show properties

ingress

object

Ingress controller configuration.

Show properties

ingressClass

string

Ingress class to use (e.g., nginx, traefik)

tlsSecretName

string

Secret containing TLS certificate

gateway

object

Gateway API configuration for HTTPRoute/GRPCRoute.

Show properties

namespace

string

required

Namespace of the Gateway resource

name

string

required

Name of the Gateway resource

httpRoutesEnabled

boolean

default:"true"

Enable HTTPRoute creation

grpcRoutesEnabled

boolean

default:"false"

Enable GRPCRoute creation

hostDomainName

string

Domain name for constructing hostnames (e.g., example.com for service.namespace.example.com)

Replicas and Scaling

spec.replicas

integer

default:"1"

Number of pod replicas. Cannot be set when autoscaling is enabled.

Advanced Examples

With Autoscaling

GPU Metrics
CPU Metrics

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 2
      metrics:
      - type: Object
        object:
          metric:
            name: gpu_cache_usage_perc
          describedObject:
            apiVersion: v1
            kind: Service
            name: meta-llama-3-2-1b-instruct
          target:
            type: Value
            value: "0.5"

scale:
  enabled: true
  hpa:
    minReplicas: 1
    maxReplicas: 3
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80

Autoscaling requires Prometheus to be deployed for GPU-based metrics. For CPU/memory metrics, the metrics-server is required.

With Ingress

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
    router:
      hostDomainName: example.com
      ingress:
        ingressClass: nginx
        tlsSecretName: nim-tls-cert

This creates an ingress with hostname: meta-llama-3-2-1b-instruct.nim-service.example.com

Inference Platforms

Standalone Platform (Default)

The default platform deploys NIM as a standard Kubernetes Deployment.

spec:
  inferencePlatform: standalone  # default, can be omitted
  # ... rest of configuration

KServe Platform

Deploy NIM as a KServe InferenceService for advanced inference features.

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  inferencePlatform: kserve
  annotations:
    serving.kserve.io/deploymentMode: 'Standard'
  labels:
    networking.kserve.io/visibility: "exposed"
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
  resources:
    limits:
      nvidia.com/gpu: 1
      cpu: "12"
      memory: 32Gi
    requests:
      nvidia.com/gpu: 1
      cpu: "4"
      memory: 6Gi
  expose:
    service:
      type: ClusterIP
      port: 8000
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80

KServe must be installed in your cluster before deploying NIMService with inferencePlatform: kserve.

Autoscaling Configuration

spec.scale

object

Horizontal Pod Autoscaler configuration.

Show properties

enabled

boolean

default:"false"

Enable autoscaling for the NIMService

hpa

object

HPA specifications.

Show properties

minReplicas

integer

default:"1"

Minimum number of replicas (≥1)

maxReplicas

integer

required

Maximum number of replicas

metrics

array

List of metrics to use for scaling decisions. Supports Resource, Object, Pods, and External metric types.

behavior

object

Scaling behavior policies for scale up/down

annotations

object

Annotations for the HPA resource

Autoscaling cannot be enabled when spec.multiNode is configured
When autoscaling is enabled, spec.replicas cannot be set

Health Probes

NIMService supports customizable liveness, readiness, and startup probes.

Default Probes

readinessProbe:
  probe:
    httpGet:
      path: /v1/health/ready
      port: api
    initialDelaySeconds: 15
    periodSeconds: 10
    timeoutSeconds: 1
    successThreshold: 1
    failureThreshold: 3

Custom Probe Example

spec:
  readinessProbe:
    enabled: true
    probe:
      httpGet:
        path: /v1/health/ready
        port: api
      initialDelaySeconds: 15
      periodSeconds: 10
  startupProbe:
    enabled: true
    probe:
      httpGet:
        path: /v1/health/ready
        port: api
      initialDelaySeconds: 900  # 15 minutes for large models
      periodSeconds: 10
      failureThreshold: 100

For large models that take significant time to load, increase the startupProbe.failureThreshold and initialDelaySeconds.

Additional Configuration

Environment Variables

spec.env

array

Custom environment variables for the NIM container.

spec:
  env:
  - name: NIM_USE_SGLANG
    value: "1"
  - name: HF_HOME
    value: /model-store/huggingface/hub
  - name: NIM_TRUST_CUSTOM_CODE
    value: "1"

Node Scheduling

Node Selector
Tolerations
Affinity

spec:
  nodeSelector:
    node.kubernetes.io/instance-type: g5.12xlarge
    topology.kubernetes.io/zone: us-west-2a

spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - NVIDIA-A100-SXM4-80GB

Runtime Configuration

spec.runtimeClassName

string

Runtime class for the pods (e.g., nvidia for GPU containers)

spec.schedulerName

string

Custom scheduler name for pod scheduling

spec.userID

integer

default:"1000"

User ID for the container process

spec.groupID

integer

default:"2000"

Group ID for the container process

Proxy Configuration

spec:
  proxy:
    httpProxy: http://proxy.example.com:8080
    httpsProxy: https://proxy.example.com:8443
    noProxy: localhost,127.0.0.1,.svc,.cluster.local
    certConfigMap: custom-ca-bundle

Status Fields

The NIMService status provides information about the deployment state:

status:
  state: Ready  # Pending, NotReady, Ready, or Failed
  availableReplicas: 1
  conditions:
  - type: NIM_SERVICE_READY
    status: "True"
    lastTransitionTime: "2024-03-03T10:15:30Z"
  model:
    name: meta-llama-3-2-1b-instruct
    clusterEndpoint: meta-llama-3-2-1b-instruct.nim-service.svc.cluster.local:8000
    externalEndpoint: meta-llama-3-2-1b-instruct.nim-service.example.com

Best Practices

Pre-cache Models

Always use NIMCache resources to pre-download and cache models before deploying NIMService. This significantly reduces startup time.

Right-size Resources

Allocate appropriate CPU, memory, and GPU resources based on your model size and expected throughput. Monitor resource usage and adjust accordingly.

Configure Health Probes

Customize startup probes for large models that require extended initialization time. Use readiness probes to ensure traffic is only sent to ready pods.

Use Persistent Storage

For production deployments, use PersistentVolumeClaims with appropriate access modes (ReadWriteMany for multi-node) instead of hostPath or emptyDir.

Enable Monitoring

Configure metrics and ServiceMonitor for production observability. Use custom metrics for intelligent autoscaling decisions.

NIMCache Resource - Pre-cache models for faster deployment
NIMPipeline Resource - Orchestrate multiple NIM services
Multi-Node Deployment - Deploy large models across multiple GPUs and nodes

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Overview

What is NIMService?

Basic Example

Core Configuration Fields

Image Configuration

Authentication

Storage Configuration

Resource Requirements

Service Exposure

Replicas and Scaling

Advanced Examples

With Autoscaling

With Ingress

Inference Platforms

Standalone Platform (Default)

KServe Platform

Autoscaling Configuration

Health Probes

Default Probes

Custom Probe Example

Additional Configuration

Environment Variables

Node Scheduling

Runtime Configuration

Proxy Configuration

Status Fields

Best Practices

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Overview

​What is NIMService?

​Basic Example

​Core Configuration Fields

​Image Configuration

​Authentication

​Storage Configuration

​Resource Requirements

​Service Exposure

​Replicas and Scaling

​Advanced Examples

​With Autoscaling

​With Ingress

​Inference Platforms

​Standalone Platform (Default)

​KServe Platform

​Autoscaling Configuration

​Health Probes

​Default Probes

​Custom Probe Example

​Additional Configuration

​Environment Variables

​Node Scheduling

​Runtime Configuration

​Proxy Configuration

​Status Fields

​Best Practices

​Related Resources

Build docs developers (and LLMs) love

Overview

What is NIMService?

Basic Example

Core Configuration Fields

Image Configuration

Authentication

Storage Configuration

Resource Requirements

Service Exposure

Replicas and Scaling

Advanced Examples

With Autoscaling

With Ingress

Inference Platforms

Standalone Platform (Default)

KServe Platform

Autoscaling Configuration

Health Probes

Default Probes

Custom Probe Example

Additional Configuration

Environment Variables

Node Scheduling

Runtime Configuration

Proxy Configuration

Status Fields

Best Practices

Related Resources