Overview
The NIMCache custom resource enables pre-downloading and caching of NVIDIA NIM models to persistent storage. By caching models ahead of time, you significantly reduce NIMService startup time and enable efficient model sharing across multiple deployments.
What is NIMCache?
NIMCache manages the model caching lifecycle:
Downloads models from NGC, NeMo DataStore, or HuggingFace Hub
Stores models in PersistentVolumeClaims or host paths
Supports optimized NIM profiles and universal NIM models
Validates and reports cached model profiles
Enables read-only model sharing across services
Model Source Types
NIMCache supports three model sources (exactly one must be specified):
NGC NVIDIA NGC Catalog - optimized NIM models
DataStore NeMo DataStore - enterprise model repository
HuggingFace HuggingFace Hub - community models
Basic Examples
NGC Model Caching
Optimized NIM
With Profile Selection
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
source :
ngc :
modelPuller : nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
pullSecret : ngc-secret
authSecret : ngc-api-secret
model :
engine : tensorrt_llm
tensorParallelism : "1"
storage :
pvc :
create : true
storageClass : ""
size : "50Gi"
volumeAccessMode : ReadWriteOnce
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
source :
ngc :
modelPuller : nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
pullSecret : ngc-secret
authSecret : ngc-api-secret
model :
profiles :
- "tensorrt_llm-h100-fp16-tp1-throughput"
- "tensorrt_llm-a100-fp16-tp1-latency"
storage :
pvc :
create : true
size : "50Gi"
volumeAccessMode : ReadWriteOnce
NeMo DataStore Source
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : meta-llama3-1b-instruct
namespace : nim-service
spec :
source :
dataStore :
endpoint : http://nemodatastore-sample.nemo.svc.cluster.local:8000/v1/hf
namespace : default
modelName : "llama-3-1b-instruct"
authSecret : hf-auth
modelPuller : nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.06
pullSecret : ngc-secret
storage :
pvc :
create : true
storageClass : ""
size : "50Gi"
volumeAccessMode : ReadWriteOnce
The namespace field for DataStore refers to the namespace within the DataStore service, not the Kubernetes namespace. Use default for models in the default DataStore namespace.
HuggingFace Hub Source
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : nim-cache-multi-llm
namespace : nim-service
spec :
source :
hf :
endpoint : "https://huggingface.co"
namespace : "meta-llama" # HuggingFace organization/user
modelName : "Llama-3.2-1B-Instruct"
authSecret : hf-secret # contains HF_TOKEN
modelPuller : nvcr.io/nim/nvidia/llm-nim:1.12
pullSecret : ngc-secret
storage :
pvc :
create : true
size : "50Gi"
volumeAccessMode : ReadWriteOnce
Source Configuration
NGC Source
Configuration for models from NVIDIA NGC catalog. Name of the Kubernetes secret containing NGC_API_KEY
Container image that pulls the model (e.g., nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0) This field is immutable. Create a new NIMCache resource if you need to change the model puller.
Image pull secret for the modelPuller image
Model profile selection criteria. Required for optimized NIMs. Specific model profiles to cache (e.g., ["tensorrt_llm-h100-fp16-tp1-throughput"]). When provided, other profile selection parameters are ignored.
Model quantization precision (e.g., fp16, fp8, int4)
Backend engine: tensorrt_llm, vllm, trtllm, etc.
Number of GPUs for tensor parallelism (e.g., "1", "2", "4")
Quality of Service profile: throughput or latency
GPU specifications for profile matching. GPU product name (e.g., h100, a100, l40s)
Device IDs for specific GPU SKUs
Whether this is a LoRA fine-tuned model
Whether to use generic buildable profiles that can be optimized for any GPU
Model endpoint for Universal NIM (mutually exclusive with model)
DataStore Source
Configuration for models from NeMo DataStore. HuggingFace endpoint from NeMo DataStore. Must match pattern: ^https?://.*/v1/hf/?$
Namespace within NeMo DataStore
Name of the model to cache (mutually exclusive with datasetName)
Name of the dataset to cache (mutually exclusive with modelName)
Secret containing HF_TOKEN for authentication
Container image for huggingface-cli (e.g., nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.06)
Image pull secret for modelPuller
Revision to cache (commit hash, branch name, or tag)
HuggingFace Hub Source
Configuration for models from HuggingFace Hub. HuggingFace endpoint URL. Must match pattern: ^https?://.*$ (e.g., https://huggingface.co)
HuggingFace organization or user namespace (e.g., meta-llama, mistralai)
Name of the model to cache (mutually exclusive with datasetName)
Name of the dataset to cache (mutually exclusive with modelName)
Secret containing HF_TOKEN for authentication (minimum length: 1)
Container image with huggingface-cli (minimum length: 1)
Image pull secret name (minimum length: 1)
Specific revision to cache (commit hash, branch, or tag)
Exactly one of spec.source.ngc, spec.source.dataStore, or spec.source.hf must be specified. You cannot mix sources.
Storage Configuration
Storage configuration for cached models. PersistentVolumeClaim configuration (recommended). Whether to create a new PVC
Name of the PVC. If not specified, defaults to {nimcache-name}-pvc
Storage class name. Leave empty to use the cluster default.
Size of the PVC (e.g., "50Gi", "100Gi")
Volume access mode:
ReadWriteOnce - Single node read-write (default)
ReadWriteMany - Multi-node read-write (required for multi-node deployments)
ReadOnlyMany - Multi-node read-only
Subdirectory within the PVC to use for caching
Custom annotations for the PVC
Deprecated: Use PVC instead.
Host path for caching. Not recommended for production use.
Resource Requirements
Minimum resources for the caching job. Minimum CPU (e.g., "2", "4000m")
Minimum memory (e.g., "4Gi", "8192Mi")
spec :
resources :
cpu : "4"
memory : "8Gi"
Job Configuration
Node Scheduling
Node selector for the caching job. Defaults to {"feature.node.kubernetes.io/pci-10de.present": "true"} (NVIDIA GPU nodes).
Tolerations for the caching job to run on tainted nodes.
spec :
nodeSelector :
node.kubernetes.io/instance-type : g5.xlarge
tolerations :
- key : nvidia.com/gpu
operator : Exists
effect : NoSchedule
Security Context
User ID for the caching job process
Group ID for the caching job process
Runtime class for the caching job
Environment Variables
Additional custom environment variables for the caching job
spec :
env :
- name : CUSTOM_VAR
value : "custom-value"
Proxy and Certificate Configuration
Proxy configuration for accessing external model sources. Comma-separated list of hosts to exclude from proxying
ConfigMap containing custom CA certificates
spec :
proxy :
httpProxy : http://proxy.example.com:8080
httpsProxy : https://proxy.example.com:8443
noProxy : localhost,127.0.0.1,.svc,.cluster.local
certConfigMap : custom-ca-bundle
Advanced Examples
Multi-Node Storage
For multi-node NIMService deployments, use ReadWriteMany access mode:
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : deepseek-r1-nimcache
namespace : nim-service
spec :
source :
ngc :
modelPuller : nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
pullSecret : ngc-secret
authSecret : ngc-api-secret
model :
storage :
pvc :
create : true
storageClass : '' # Use storage class that supports RWX
size : "100Gi"
volumeAccessMode : ReadWriteMany
Ensure your storage class supports ReadWriteMany for multi-node deployments. Common options include NFS, CephFS, or cloud-native solutions like EFS (AWS) or Filestore (GCP).
GPU-Specific Profiles
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : llama-optimized
namespace : nim-service
spec :
source :
ngc :
modelPuller : nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
pullSecret : ngc-secret
authSecret : ngc-api-secret
model :
engine : tensorrt_llm
precision : fp16
tensorParallelism : "1"
qosProfile : throughput
gpus :
- product : h100
ids :
- "2331" # H100 SXM device ID
storage :
pvc :
create : true
size : "100Gi"
volumeAccessMode : ReadWriteOnce
With Custom Init Containers
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
source :
ngc :
modelPuller : nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
pullSecret : ngc-secret
authSecret : ngc-api-secret
model :
engine : tensorrt_llm
tensorParallelism : "1"
storage :
pvc :
create : true
size : "50Gi"
volumeAccessMode : ReadWriteOnce
initContainers :
- name : prepare-storage
image :
repository : busybox
tag : latest
command :
- sh
- -c
args :
- |
mkdir -p /model-store/custom
chmod 755 /model-store
Status and Monitoring
The NIMCache status provides information about the caching process:
status :
state : Ready # NotReady, PVC-Created, Started, InProgress, Ready, Pending, or Failed
pvc : meta-llama-3-2-1b-instruct-pvc
profiles :
- name : tensorrt_llm-h100-fp16-tp1-throughput
model : meta-llama-3-2-1b-instruct
release : 1.12.0
config :
engine : tensorrt_llm
precision : fp16
tensorParallelism : "1"
conditions :
- type : NIM_CACHE_JOB_COMPLETED
status : "True"
lastTransitionTime : "2024-03-03T10:15:30Z"
reason : CachingCompleted
message : Model caching completed successfully
Status States
NotReady
Initial state when the NIMCache is created
PVC-Created
PersistentVolumeClaim has been created
Started
Caching job has been created and started
InProgress
Model download and caching is in progress
Ready
Caching completed successfully, ready for use by NIMService
Pending
Waiting for resources or other dependencies
Failed
Caching job failed (check conditions for details)
Using Cached Models
Once a NIMCache is in Ready state, reference it in your NIMService:
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
image :
repository : nvcr.io/nim/meta/llama-3.2-1b-instruct
tag : "1.12.0"
pullSecrets :
- ngc-secret
authSecret : ngc-api-secret
storage :
nimCache :
name : meta-llama-3-2-1b-instruct # Reference to NIMCache
profile : '' # Optional: specify a specific profile
resources :
limits :
nvidia.com/gpu : 1
expose :
service :
type : ClusterIP
port : 8000
Best Practices
Pre-cache Before Deployment Create NIMCache resources before deploying NIMService to minimize startup time. Monitor the cache status to ensure it’s Ready.
Use Appropriate Storage Select storage class based on your needs:
Single-node: ReadWriteOnce (faster, cheaper)
Multi-node: ReadWriteMany (required for shared access)
Size Your Storage Allocate sufficient PVC size:
Small models (1-3B): 50Gi
Medium models (7-13B): 100-200Gi
Large models (70B+): 300-500Gi
Reuse Caches Share a single NIMCache across multiple NIMService instances to save storage and reduce duplication.
When using HuggingFace or DataStore sources, ensure you have the proper authentication tokens and network access to the endpoints.
Troubleshooting
Cache Job Fails
Check the caching job logs:
kubectl logs -n nim-service -l nimcache={nimcache-name}
PVC Not Created
Verify storage class exists and has available capacity:
kubectl get storageclass
kubectl get pvc -n nim-service
Authentication Issues
Ensure secrets contain the correct keys:
NGC: NGC_API_KEY
HuggingFace: HF_TOKEN
kubectl get secret -n nim-service ngc-api-secret -o yaml
kubectl get secret -n nim-service hf-secret -o yaml