NIMCache Resource - NVIDIA NIM Operator

Overview

The NIMCache custom resource enables pre-downloading and caching of NVIDIA NIM models to persistent storage. By caching models ahead of time, you significantly reduce NIMService startup time and enable efficient model sharing across multiple deployments.

What is NIMCache?

NIMCache manages the model caching lifecycle:

Downloads models from NGC, NeMo DataStore, or HuggingFace Hub
Stores models in PersistentVolumeClaims or host paths
Supports optimized NIM profiles and universal NIM models
Validates and reports cached model profiles
Enables read-only model sharing across services

Model Source Types

NIMCache supports three model sources (exactly one must be specified):

NGC

NVIDIA NGC Catalog - optimized NIM models

DataStore

NeMo DataStore - enterprise model repository

HuggingFace

HuggingFace Hub - community models

Basic Examples

NGC Model Caching

Optimized NIM
With Profile Selection

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        profiles:
          - "tensorrt_llm-h100-fp16-tp1-throughput"
          - "tensorrt_llm-a100-fp16-tp1-latency"
  storage:
    pvc:
      create: true
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

NeMo DataStore Source

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-1b-instruct
  namespace: nim-service
spec:
  source:
    dataStore:
      endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000/v1/hf
      namespace: default
      modelName: "llama-3-1b-instruct"
      authSecret: hf-auth
      modelPuller: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.06
      pullSecret: ngc-secret
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

The namespace field for DataStore refers to the namespace within the DataStore service, not the Kubernetes namespace. Use default for models in the default DataStore namespace.

HuggingFace Hub Source

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: nim-cache-multi-llm
  namespace: nim-service
spec:
  source:
    hf:
      endpoint: "https://huggingface.co"
      namespace: "meta-llama"  # HuggingFace organization/user
      modelName: "Llama-3.2-1B-Instruct"
      authSecret: hf-secret  # contains HF_TOKEN
      modelPuller: nvcr.io/nim/nvidia/llm-nim:1.12
      pullSecret: ngc-secret
  storage:
    pvc:
      create: true
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

Source Configuration

NGC Source

spec.source.ngc

object

Configuration for models from NVIDIA NGC catalog.

Show properties

authSecret

string

required

Name of the Kubernetes secret containing NGC_API_KEY

modelPuller

string

required

Container image that pulls the model (e.g., nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0)

This field is immutable. Create a new NIMCache resource if you need to change the model puller.

pullSecret

string

Image pull secret for the modelPuller image

model

object

Model profile selection criteria. Required for optimized NIMs.

Show properties

profiles

array

Specific model profiles to cache (e.g., ["tensorrt_llm-h100-fp16-tp1-throughput"]). When provided, other profile selection parameters are ignored.

precision

string

Model quantization precision (e.g., fp16, fp8, int4)

engine

string

Backend engine: tensorrt_llm, vllm, trtllm, etc.

tensorParallelism

string

Number of GPUs for tensor parallelism (e.g., "1", "2", "4")

qosProfile

string

Quality of Service profile: throughput or latency

gpus

array

GPU specifications for profile matching.

Show items

product

string

GPU product name (e.g., h100, a100, l40s)

ids

array

Device IDs for specific GPU SKUs

lora

boolean

Whether this is a LoRA fine-tuned model

buildable

boolean

Whether to use generic buildable profiles that can be optimized for any GPU

modelEndpoint

string

Model endpoint for Universal NIM (mutually exclusive with model)

DataStore Source

spec.source.dataStore

object

Configuration for models from NeMo DataStore.

Show properties

endpoint

string

required

HuggingFace endpoint from NeMo DataStore. Must match pattern: ^https?://.*/v1/hf/?$

namespace

string

default:"default"

Namespace within NeMo DataStore

modelName

string

Name of the model to cache (mutually exclusive with datasetName)

datasetName

string

Name of the dataset to cache (mutually exclusive with modelName)

authSecret

string

required

Secret containing HF_TOKEN for authentication

modelPuller

string

required

Container image for huggingface-cli (e.g., nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.06)

pullSecret

string

required

Image pull secret for modelPuller

revision

string

Revision to cache (commit hash, branch name, or tag)

HuggingFace Hub Source

spec.source.hf

object

Configuration for models from HuggingFace Hub.

Show properties

endpoint

string

required

HuggingFace endpoint URL. Must match pattern: ^https?://.*$ (e.g., https://huggingface.co)

namespace

string

required

HuggingFace organization or user namespace (e.g., meta-llama, mistralai)

modelName

string

Name of the model to cache (mutually exclusive with datasetName)

datasetName

string

Name of the dataset to cache (mutually exclusive with modelName)

authSecret

string

required

Secret containing HF_TOKEN for authentication (minimum length: 1)

modelPuller

string

required

Container image with huggingface-cli (minimum length: 1)

pullSecret

string

required

Image pull secret name (minimum length: 1)

revision

string

Specific revision to cache (commit hash, branch, or tag)

Exactly one of spec.source.ngc, spec.source.dataStore, or spec.source.hf must be specified. You cannot mix sources.

Storage Configuration

spec.storage

object

required

Storage configuration for cached models.

Show properties

pvc

object

PersistentVolumeClaim configuration (recommended).

Show properties

create

boolean

Whether to create a new PVC

name

string

Name of the PVC. If not specified, defaults to {nimcache-name}-pvc

storageClass

string

Storage class name. Leave empty to use the cluster default.

size

string

Size of the PVC (e.g., "50Gi", "100Gi")

volumeAccessMode

string

Volume access mode:

ReadWriteOnce - Single node read-write (default)
ReadWriteMany - Multi-node read-write (required for multi-node deployments)
ReadOnlyMany - Multi-node read-only

subPath

string

Subdirectory within the PVC to use for caching

annotations

object

Custom annotations for the PVC

hostPath

string

Deprecated: Use PVC instead.

Host path for caching. Not recommended for production use.

Resource Requirements

spec.resources

object

Minimum resources for the caching job.

Show properties

cpu

string

Minimum CPU (e.g., "2", "4000m")

memory

string

Minimum memory (e.g., "4Gi", "8192Mi")

spec:
  resources:
    cpu: "4"
    memory: "8Gi"

Job Configuration

Node Scheduling

spec.nodeSelector

object

Node selector for the caching job. Defaults to {"feature.node.kubernetes.io/pci-10de.present": "true"} (NVIDIA GPU nodes).

spec.tolerations

array

Tolerations for the caching job to run on tainted nodes.

spec:
  nodeSelector:
    node.kubernetes.io/instance-type: g5.xlarge
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Security Context

spec.userID

integer

default:"1000"

User ID for the caching job process

spec.groupID

integer

default:"2000"

Group ID for the caching job process

spec.runtimeClassName

string

Runtime class for the caching job

Environment Variables

spec.env

array

Additional custom environment variables for the caching job

spec:
  env:
  - name: CUSTOM_VAR
    value: "custom-value"

Proxy and Certificate Configuration

spec.proxy

object

Proxy configuration for accessing external model sources.

Show properties

httpProxy

string

HTTP proxy URL

httpsProxy

string

HTTPS proxy URL

noProxy

string

Comma-separated list of hosts to exclude from proxying

certConfigMap

string

ConfigMap containing custom CA certificates

spec:
  proxy:
    httpProxy: http://proxy.example.com:8080
    httpsProxy: https://proxy.example.com:8443
    noProxy: localhost,127.0.0.1,.svc,.cluster.local
    certConfigMap: custom-ca-bundle

Advanced Examples

Multi-Node Storage

For multi-node NIMService deployments, use ReadWriteMany access mode:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: deepseek-r1-nimcache
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
  storage:
    pvc:
      create: true
      storageClass: ''  # Use storage class that supports RWX
      size: "100Gi"
      volumeAccessMode: ReadWriteMany

Ensure your storage class supports ReadWriteMany for multi-node deployments. Common options include NFS, CephFS, or cloud-native solutions like EFS (AWS) or Filestore (GCP).

GPU-Specific Profiles

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: llama-optimized
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        precision: fp16
        tensorParallelism: "1"
        qosProfile: throughput
        gpus:
        - product: h100
          ids:
          - "2331"  # H100 SXM device ID
  storage:
    pvc:
      create: true
      size: "100Gi"
      volumeAccessMode: ReadWriteOnce

With Custom Init Containers

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce
  initContainers:
  - name: prepare-storage
    image:
      repository: busybox
      tag: latest
    command:
    - sh
    - -c
    args:
    - |
      mkdir -p /model-store/custom
      chmod 755 /model-store

Status and Monitoring

The NIMCache status provides information about the caching process:

status:
  state: Ready  # NotReady, PVC-Created, Started, InProgress, Ready, Pending, or Failed
  pvc: meta-llama-3-2-1b-instruct-pvc
  profiles:
  - name: tensorrt_llm-h100-fp16-tp1-throughput
    model: meta-llama-3-2-1b-instruct
    release: 1.12.0
    config:
      engine: tensorrt_llm
      precision: fp16
      tensorParallelism: "1"
  conditions:
  - type: NIM_CACHE_JOB_COMPLETED
    status: "True"
    lastTransitionTime: "2024-03-03T10:15:30Z"
    reason: CachingCompleted
    message: Model caching completed successfully

Status States

NotReady

Initial state when the NIMCache is created

PVC-Created

PersistentVolumeClaim has been created

Started

Caching job has been created and started

InProgress

Model download and caching is in progress

Ready

Caching completed successfully, ready for use by NIMService

Pending

Waiting for resources or other dependencies

Failed

Caching job failed (check conditions for details)

Using Cached Models

Once a NIMCache is in Ready state, reference it in your NIMService:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct  # Reference to NIMCache
      profile: ''  # Optional: specify a specific profile
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Best Practices

Pre-cache Before Deployment

Create NIMCache resources before deploying NIMService to minimize startup time. Monitor the cache status to ensure it’s Ready.

Use Appropriate Storage

Select storage class based on your needs:

Single-node: ReadWriteOnce (faster, cheaper)
Multi-node: ReadWriteMany (required for shared access)

Size Your Storage

Allocate sufficient PVC size:

Small models (1-3B): 50Gi
Medium models (7-13B): 100-200Gi
Large models (70B+): 300-500Gi

Reuse Caches

Share a single NIMCache across multiple NIMService instances to save storage and reduce duplication.

When using HuggingFace or DataStore sources, ensure you have the proper authentication tokens and network access to the endpoints.

Troubleshooting

Cache Job Fails

Check the caching job logs:

kubectl logs -n nim-service -l nimcache={nimcache-name}

PVC Not Created

Verify storage class exists and has available capacity:

kubectl get storageclass
kubectl get pvc -n nim-service

Authentication Issues

Ensure secrets contain the correct keys:

NGC: NGC_API_KEY
HuggingFace: HF_TOKEN

kubectl get secret -n nim-service ngc-api-secret -o yaml
kubectl get secret -n nim-service hf-secret -o yaml

NIMService Resource - Deploy NIM services using cached models
NIMPipeline Resource - Orchestrate multiple services with shared caches

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Overview

​What is NIMCache?

​Model Source Types

NGC

DataStore

HuggingFace

​Basic Examples

​NGC Model Caching

​NeMo DataStore Source

​HuggingFace Hub Source

​Source Configuration

​NGC Source

​DataStore Source

​HuggingFace Hub Source

​Storage Configuration

​Resource Requirements

​Job Configuration

​Node Scheduling

​Security Context

​Environment Variables

​Proxy and Certificate Configuration

​Advanced Examples

​Multi-Node Storage

​GPU-Specific Profiles

​With Custom Init Containers

​Status and Monitoring

​Status States

​Using Cached Models

​Best Practices

Pre-cache Before Deployment

Use Appropriate Storage

Size Your Storage

Reuse Caches

​Troubleshooting

​Cache Job Fails

​PVC Not Created

​Authentication Issues

​Related Resources

Build docs developers (and LLMs) love

Overview

What is NIMCache?

Model Source Types

Basic Examples

NGC Model Caching

NeMo DataStore Source

HuggingFace Hub Source

Source Configuration

NGC Source

DataStore Source

HuggingFace Hub Source

Storage Configuration

Resource Requirements

Job Configuration

Node Scheduling

Security Context

Environment Variables

Proxy and Certificate Configuration

Advanced Examples

Multi-Node Storage

GPU-Specific Profiles

With Custom Init Containers

Status and Monitoring

Status States

Using Cached Models

Best Practices

Troubleshooting

Cache Job Fails

PVC Not Created

Authentication Issues

Related Resources