Quick start
This guide walks you through deploying your first NIM microservice using the NVIDIA NIM Operator.
Prerequisites
Before you begin, ensure you have:
Kubernetes cluster
A Kubernetes cluster running version 1.28 or higher with GPU nodes.
NVIDIA GPU Operator
The NVIDIA GPU Operator installed to provide GPU device plugins and drivers.
NGC API key
An NGC API key from NVIDIA NGC . This is required to pull NIM container images and model artifacts.
Storage class
A StorageClass configured in your cluster for persistent volume claims (for model caching).
Each NIM requires at least one GPU. Ensure your cluster has available GPU resources.
Step 1: Create a namespace
Create a dedicated namespace for your NIM deployments:
kubectl create namespace nim-service
Step 2: Create NGC secrets
Create Kubernetes secrets containing your NGC credentials:
NGC API Secret
Image Pull Secret
kubectl create secret generic ngc-api-secret \
--from-literal=NGC_API_KEY= < your-ngc-api-key > \
-n nim-service
Replace <your-ngc-api-key> with your actual NGC API key.
Step 3: Deploy a NIM microservice
Deploy a LLaMA 3.2 1B model using NIMCache and NIMService resources:
Complete Deployment
NIMCache Only
NIMService Only
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
source :
ngc :
modelPuller : nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
pullSecret : ngc-secret
authSecret : ngc-api-secret
model :
engine : tensorrt_llm
tensorParallelism : "1"
storage :
pvc :
create : true
storageClass : ""
size : "50Gi"
volumeAccessMode : ReadWriteOnce
---
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : meta-llama-3-2-1b-instruct
namespace : nim-service
spec :
image :
repository : nvcr.io/nim/meta/llama-3.2-1b-instruct
tag : "1.12.0"
pullPolicy : IfNotPresent
pullSecrets :
- ngc-secret
authSecret : ngc-api-secret
storage :
nimCache :
name : meta-llama-3-2-1b-instruct
profile : ''
replicas : 1
resources :
limits :
nvidia.com/gpu : 1
expose :
service :
type : ClusterIP
port : 8000
Save this to a file (e.g., llama-nim.yaml) and apply it:
kubectl apply -f llama-nim.yaml
Model caching can take several minutes depending on your network speed and the model size. The NIMCache job downloads and processes the model artifacts.
Step 4: Monitor the deployment
Watch the NIMCache job complete:
kubectl get nimcache -n nim-service
kubectl get jobs -n nim-service
kubectl logs -n nim-service job/meta-llama-3-2-1b-instruct -f
Once the cache is ready, check the NIMService status:
kubectl get nimservice -n nim-service
kubectl describe nimservice meta-llama-3-2-1b-instruct -n nim-service
Expected output when ready:
NAME STATUS AGE
meta-llama-3-2-1b-instruct Ready 5m
Step 5: Verify the deployment
Check that the NIMService pod is running:
kubectl get pods -n nim-service
You should see a pod with status Running:
NAME READY STATUS RESTARTS AGE
meta-llama-3-2-1b-instruct-7d9f8c5b6d-x9k2p 1/1 Running 0 3m
Check the pod logs:
kubectl logs -n nim-service -l app=meta-llama-3-2-1b-instruct
Step 6: Test the inference endpoint
Port-forward the service to your local machine:
kubectl port-forward -n nim-service svc/meta-llama-3-2-1b-instruct 8000:8000
Test the health endpoint:
curl http://localhost:8000/v1/health/ready
Send an inference request:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama-3-2-1b-instruct",
"prompt": "Once upon a time",
"max_tokens": 50
}'
The NIM microservices expose an OpenAI-compatible API, making them easy to integrate with existing applications.
Understanding the deployment
Let’s break down what we deployed:
NIMCache resource
The NIMCache resource handles model artifact caching:
source.ngc - Specifies NGC as the model source
modelPuller - Container image that downloads the model
pullSecret - Docker registry credentials for pulling images
authSecret - NGC API key for authentication
model.engine - Inference engine (tensorrt_llm)
model.tensorParallelism - Number of GPUs for tensor parallelism
storage.pvc - Persistent volume claim configuration
NIMService resource
The NIMService resource deploys the inference service:
image - NIM container image and tag
authSecret - NGC API key (required at runtime)
storage.nimCache - References the NIMCache for model artifacts
replicas - Number of pod replicas
resources.limits - GPU resources required
expose.service - Service type and port configuration
Next steps
Configure autoscaling Enable horizontal pod autoscaling for dynamic scaling
Expose via Ingress Configure Ingress or Gateway API for external access
Multi-model pipelines Orchestrate RAG pipelines with multiple models
Production deployment Best practices for production deployments
Troubleshooting
NIMCache job fails
If the caching job fails, check:
NGC credentials are correct
Storage class exists and can provision volumes
Sufficient disk space is available
kubectl describe nimcache meta-llama-3-2-1b-instruct -n nim-service
kubectl logs -n nim-service job/meta-llama-3-2-1b-instruct
NIMService pod not starting
If the pod doesn’t start, verify:
GPU resources are available: kubectl describe nodes
Image pull secrets are configured correctly
NIMCache completed successfully
kubectl describe pod -n nim-service -l app=meta-llama-3-2-1b-instruct
Pod is running but not ready
If the pod is running but not passing readiness checks:
Check the startup probe timeout (default 20 minutes)
Verify the model loaded correctly from cache
Check GPU allocation
kubectl logs -n nim-service -l app=meta-llama-3-2-1b-instruct --tail=100
Clean up
To remove the deployment:
kubectl delete nimservice meta-llama-3-2-1b-instruct -n nim-service
kubectl delete nimcache meta-llama-3-2-1b-instruct -n nim-service
kubectl delete namespace nim-service
Deleting the NIMCache will also delete the persistent volume claim and cached model artifacts.