Skip to main content

Overview

The NIMCache custom resource enables pre-downloading and caching of NVIDIA NIM models to persistent storage. By caching models ahead of time, you significantly reduce NIMService startup time and enable efficient model sharing across multiple deployments.

What is NIMCache?

NIMCache manages the model caching lifecycle:
  • Downloads models from NGC, NeMo DataStore, or HuggingFace Hub
  • Stores models in PersistentVolumeClaims or host paths
  • Supports optimized NIM profiles and universal NIM models
  • Validates and reports cached model profiles
  • Enables read-only model sharing across services

Model Source Types

NIMCache supports three model sources (exactly one must be specified):

NGC

NVIDIA NGC Catalog - optimized NIM models

DataStore

NeMo DataStore - enterprise model repository

HuggingFace

HuggingFace Hub - community models

Basic Examples

NGC Model Caching

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

NeMo DataStore Source

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-1b-instruct
  namespace: nim-service
spec:
  source:
    dataStore:
      endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000/v1/hf
      namespace: default
      modelName: "llama-3-1b-instruct"
      authSecret: hf-auth
      modelPuller: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.06
      pullSecret: ngc-secret
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce
The namespace field for DataStore refers to the namespace within the DataStore service, not the Kubernetes namespace. Use default for models in the default DataStore namespace.

HuggingFace Hub Source

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: nim-cache-multi-llm
  namespace: nim-service
spec:
  source:
    hf:
      endpoint: "https://huggingface.co"
      namespace: "meta-llama"  # HuggingFace organization/user
      modelName: "Llama-3.2-1B-Instruct"
      authSecret: hf-secret  # contains HF_TOKEN
      modelPuller: nvcr.io/nim/nvidia/llm-nim:1.12
      pullSecret: ngc-secret
  storage:
    pvc:
      create: true
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

Source Configuration

NGC Source

spec.source.ngc
object
Configuration for models from NVIDIA NGC catalog.

DataStore Source

spec.source.dataStore
object
Configuration for models from NeMo DataStore.

HuggingFace Hub Source

spec.source.hf
object
Configuration for models from HuggingFace Hub.
Exactly one of spec.source.ngc, spec.source.dataStore, or spec.source.hf must be specified. You cannot mix sources.

Storage Configuration

spec.storage
object
required
Storage configuration for cached models.

Resource Requirements

spec.resources
object
Minimum resources for the caching job.
spec:
  resources:
    cpu: "4"
    memory: "8Gi"

Job Configuration

Node Scheduling

spec.nodeSelector
object
Node selector for the caching job. Defaults to {"feature.node.kubernetes.io/pci-10de.present": "true"} (NVIDIA GPU nodes).
spec.tolerations
array
Tolerations for the caching job to run on tainted nodes.
spec:
  nodeSelector:
    node.kubernetes.io/instance-type: g5.xlarge
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Security Context

spec.userID
integer
default:"1000"
User ID for the caching job process
spec.groupID
integer
default:"2000"
Group ID for the caching job process
spec.runtimeClassName
string
Runtime class for the caching job

Environment Variables

spec.env
array
Additional custom environment variables for the caching job
spec:
  env:
  - name: CUSTOM_VAR
    value: "custom-value"

Proxy and Certificate Configuration

spec.proxy
object
Proxy configuration for accessing external model sources.
spec:
  proxy:
    httpProxy: http://proxy.example.com:8080
    httpsProxy: https://proxy.example.com:8443
    noProxy: localhost,127.0.0.1,.svc,.cluster.local
    certConfigMap: custom-ca-bundle

Advanced Examples

Multi-Node Storage

For multi-node NIMService deployments, use ReadWriteMany access mode:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: deepseek-r1-nimcache
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
  storage:
    pvc:
      create: true
      storageClass: ''  # Use storage class that supports RWX
      size: "100Gi"
      volumeAccessMode: ReadWriteMany
Ensure your storage class supports ReadWriteMany for multi-node deployments. Common options include NFS, CephFS, or cloud-native solutions like EFS (AWS) or Filestore (GCP).

GPU-Specific Profiles

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: llama-optimized
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        precision: fp16
        tensorParallelism: "1"
        qosProfile: throughput
        gpus:
        - product: h100
          ids:
          - "2331"  # H100 SXM device ID
  storage:
    pvc:
      create: true
      size: "100Gi"
      volumeAccessMode: ReadWriteOnce

With Custom Init Containers

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce
  initContainers:
  - name: prepare-storage
    image:
      repository: busybox
      tag: latest
    command:
    - sh
    - -c
    args:
    - |
      mkdir -p /model-store/custom
      chmod 755 /model-store

Status and Monitoring

The NIMCache status provides information about the caching process:
status:
  state: Ready  # NotReady, PVC-Created, Started, InProgress, Ready, Pending, or Failed
  pvc: meta-llama-3-2-1b-instruct-pvc
  profiles:
  - name: tensorrt_llm-h100-fp16-tp1-throughput
    model: meta-llama-3-2-1b-instruct
    release: 1.12.0
    config:
      engine: tensorrt_llm
      precision: fp16
      tensorParallelism: "1"
  conditions:
  - type: NIM_CACHE_JOB_COMPLETED
    status: "True"
    lastTransitionTime: "2024-03-03T10:15:30Z"
    reason: CachingCompleted
    message: Model caching completed successfully

Status States

1

NotReady

Initial state when the NIMCache is created
2

PVC-Created

PersistentVolumeClaim has been created
3

Started

Caching job has been created and started
4

InProgress

Model download and caching is in progress
5

Ready

Caching completed successfully, ready for use by NIMService
6

Pending

Waiting for resources or other dependencies
7

Failed

Caching job failed (check conditions for details)

Using Cached Models

Once a NIMCache is in Ready state, reference it in your NIMService:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct  # Reference to NIMCache
      profile: ''  # Optional: specify a specific profile
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Best Practices

Pre-cache Before Deployment

Create NIMCache resources before deploying NIMService to minimize startup time. Monitor the cache status to ensure it’s Ready.

Use Appropriate Storage

Select storage class based on your needs:
  • Single-node: ReadWriteOnce (faster, cheaper)
  • Multi-node: ReadWriteMany (required for shared access)

Size Your Storage

Allocate sufficient PVC size:
  • Small models (1-3B): 50Gi
  • Medium models (7-13B): 100-200Gi
  • Large models (70B+): 300-500Gi

Reuse Caches

Share a single NIMCache across multiple NIMService instances to save storage and reduce duplication.
When using HuggingFace or DataStore sources, ensure you have the proper authentication tokens and network access to the endpoints.

Troubleshooting

Cache Job Fails

Check the caching job logs:
kubectl logs -n nim-service -l nimcache={nimcache-name}

PVC Not Created

Verify storage class exists and has available capacity:
kubectl get storageclass
kubectl get pvc -n nim-service

Authentication Issues

Ensure secrets contain the correct keys:
  • NGC: NGC_API_KEY
  • HuggingFace: HF_TOKEN
kubectl get secret -n nim-service ngc-api-secret -o yaml
kubectl get secret -n nim-service hf-secret -o yaml

Build docs developers (and LLMs) love