Quick start

This guide walks you through deploying your first NIM microservice using the NVIDIA NIM Operator.

Prerequisites

Before you begin, ensure you have:

Kubernetes cluster

A Kubernetes cluster running version 1.28 or higher with GPU nodes.

NVIDIA GPU Operator

The NVIDIA GPU Operator installed to provide GPU device plugins and drivers.

NIM Operator

The NVIDIA NIM Operator installed in your cluster. See the installation guide.

NGC API key

An NGC API key from NVIDIA NGC. This is required to pull NIM container images and model artifacts.

Storage class

A StorageClass configured in your cluster for persistent volume claims (for model caching).

Each NIM requires at least one GPU. Ensure your cluster has available GPU resources.

Step 1: Create a namespace

Create a dedicated namespace for your NIM deployments:

kubectl create namespace nim-service

Step 2: Create NGC secrets

Create Kubernetes secrets containing your NGC credentials:

kubectl create secret generic ngc-api-secret \
  --from-literal=NGC_API_KEY=<your-ngc-api-key> \
  -n nim-service

Replace <your-ngc-api-key> with your actual NGC API key.

Step 3: Deploy a NIM microservice

Deploy a LLaMA 3.2 1B model using NIMCache and NIMService resources:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Save this to a file (e.g., llama-nim.yaml) and apply it:

kubectl apply -f llama-nim.yaml

Model caching can take several minutes depending on your network speed and the model size. The NIMCache job downloads and processes the model artifacts.

Step 4: Monitor the deployment

Watch the NIMCache job complete:

kubectl get nimcache -n nim-service
kubectl get jobs -n nim-service
kubectl logs -n nim-service job/meta-llama-3-2-1b-instruct -f

Once the cache is ready, check the NIMService status:

kubectl get nimservice -n nim-service
kubectl describe nimservice meta-llama-3-2-1b-instruct -n nim-service

Expected output when ready:

NAME                          STATUS   AGE
meta-llama-3-2-1b-instruct    Ready    5m

Step 5: Verify the deployment

Check that the NIMService pod is running:

kubectl get pods -n nim-service

You should see a pod with status Running:

NAME                                          READY   STATUS    RESTARTS   AGE
meta-llama-3-2-1b-instruct-7d9f8c5b6d-x9k2p   1/1     Running   0          3m

Check the pod logs:

kubectl logs -n nim-service -l app=meta-llama-3-2-1b-instruct

Step 6: Test the inference endpoint

Port-forward the service to your local machine:

kubectl port-forward -n nim-service svc/meta-llama-3-2-1b-instruct 8000:8000

Test the health endpoint:

curl http://localhost:8000/v1/health/ready

Send an inference request:

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama-3-2-1b-instruct",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'

The NIM microservices expose an OpenAI-compatible API, making them easy to integrate with existing applications.

Understanding the deployment

Let’s break down what we deployed:

NIMCache resource

The NIMCache resource handles model artifact caching:

source.ngc - Specifies NGC as the model source
modelPuller - Container image that downloads the model
pullSecret - Docker registry credentials for pulling images
authSecret - NGC API key for authentication
model.engine - Inference engine (tensorrt_llm)
model.tensorParallelism - Number of GPUs for tensor parallelism
storage.pvc - Persistent volume claim configuration

NIMService resource

The NIMService resource deploys the inference service:

image - NIM container image and tag
authSecret - NGC API key (required at runtime)
storage.nimCache - References the NIMCache for model artifacts
replicas - Number of pod replicas
resources.limits - GPU resources required
expose.service - Service type and port configuration

Next steps

Configure autoscaling

Enable horizontal pod autoscaling for dynamic scaling

Expose via Ingress

Configure Ingress or Gateway API for external access

Multi-model pipelines

Orchestrate RAG pipelines with multiple models

Production deployment

Best practices for production deployments

Troubleshooting

NIMCache job fails

If the caching job fails, check:

NGC credentials are correct
Storage class exists and can provision volumes
Sufficient disk space is available

kubectl describe nimcache meta-llama-3-2-1b-instruct -n nim-service
kubectl logs -n nim-service job/meta-llama-3-2-1b-instruct

NIMService pod not starting

If the pod doesn’t start, verify:

GPU resources are available: kubectl describe nodes
Image pull secrets are configured correctly
NIMCache completed successfully

kubectl describe pod -n nim-service -l app=meta-llama-3-2-1b-instruct

Pod is running but not ready

If the pod is running but not passing readiness checks:

Check the startup probe timeout (default 20 minutes)
Verify the model loaded correctly from cache
Check GPU allocation

kubectl logs -n nim-service -l app=meta-llama-3-2-1b-instruct --tail=100

Clean up

To remove the deployment:

kubectl delete nimservice meta-llama-3-2-1b-instruct -n nim-service
kubectl delete nimcache meta-llama-3-2-1b-instruct -n nim-service
kubectl delete namespace nim-service

Deleting the NIMCache will also delete the persistent volume claim and cached model artifacts.

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Quick start

Quick start

Prerequisites

Step 1: Create a namespace

Step 2: Create NGC secrets

Step 3: Deploy a NIM microservice

Step 4: Monitor the deployment

Step 5: Verify the deployment

Step 6: Test the inference endpoint

Understanding the deployment

NIMCache resource

NIMService resource

Next steps

Configure autoscaling

Expose via Ingress

Multi-model pipelines

Production deployment

Troubleshooting

NIMCache job fails

NIMService pod not starting

Pod is running but not ready

Clean up

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Quick start

​Prerequisites

​Step 1: Create a namespace

​Step 2: Create NGC secrets

​Step 3: Deploy a NIM microservice

​Step 4: Monitor the deployment

​Step 5: Verify the deployment

​Step 6: Test the inference endpoint

​Understanding the deployment

​NIMCache resource

​NIMService resource

​Next steps

Configure autoscaling

Expose via Ingress

Multi-model pipelines

Production deployment

​Troubleshooting

​NIMCache job fails

​NIMService pod not starting

​Pod is running but not ready

​Clean up

Build docs developers (and LLMs) love

Quick start

Prerequisites

Step 1: Create a namespace

Step 2: Create NGC secrets

Step 3: Deploy a NIM microservice

Step 4: Monitor the deployment

Step 5: Verify the deployment

Step 6: Test the inference endpoint

Understanding the deployment

NIMCache resource

NIMService resource

Next steps

Troubleshooting

NIMCache job fails

NIMService pod not starting

Pod is running but not ready

Clean up