Skip to main content

NIMBuild resource

The NIMBuild resource allows you to build optimized TensorRT-LLM (TRT-LLM) engines from model weights that have been cached using NIMCache. Building custom engines can significantly improve inference performance by optimizing for your specific GPU hardware and deployment configuration.

Overview

NIMBuild creates a Kubernetes Job that:
  1. References a NIMCache resource containing model weights
  2. Builds optimized TensorRT-LLM engines for the specified profile
  3. Stores the built engine alongside the original model weights
  4. Makes the optimized engine available for NIMService deployment
NIMBuild requires that the NIMCache resource is in a Ready state with buildable profiles available.

Basic example

Here’s a basic NIMBuild configuration that builds an optimized engine from a cached model:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMBuild
metadata:
  name: llama-3-8b-engine
  namespace: default
spec:
  nimCache:
    name: llama-3-8b-cache
    profile: 8b-tp1-pp1-h100-fp16
  modelName: llama-3-8b-optimized
  image:
    repository: nvcr.io/nvidia/nim-llm
    tag: "1.2.0"
    pullSecrets:
      - ngc-secret
  resources:
    limits:
      nvidia.com/gpu: 1
      memory: 64Gi
    requests:
      nvidia.com/gpu: 1
      memory: 32Gi
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

When to use NIMBuild

Use NIMBuild when you need:
  • Maximum inference performance - Build engines optimized for your specific GPU hardware
  • Custom model configurations - Fine-tune tensor parallelism and other engine parameters
  • Reduced latency - Pre-built engines eliminate runtime compilation overhead
  • Production deployments - Consistent performance with optimized engines
Building TensorRT-LLM engines is a resource-intensive operation that can take 30 minutes to several hours depending on model size and GPU availability. Plan accordingly.

Configuration

NIMCache reference

The nimCache field references the NIMCache resource containing the source model weights:
nimCache.name
string
required
Name of the NIMCache resource containing the model weights
nimCache.profile
string
Specific profile to build from the NIMCache. If omitted and only one buildable profile exists, it will be used automatically. If multiple buildable profiles exist, you must specify which one to build.

Model name

modelName
string
Name for the built engine model. If not specified, defaults to the NIMBuild resource name. This name is used in the manifest and can be referenced by NIMService.

Image configuration

image
object
required
Container image used for building the TRT-LLM engine

Resource requirements

resources
object
Resource requests and limits for the build job
Resource recommendations by model size:
  • Small models (under 7B): 1 GPU, 32Gi memory
  • Medium models (7B-70B): 1-2 GPUs, 64Gi memory
  • Large models (70B+): 2-8 GPUs, 128Gi memory

Scheduling

nodeSelector
object
Node selector labels to schedule the build job on specific nodes. Use this to target nodes with specific GPU types.
tolerations
array
Tolerations for the build job to run on tainted nodes

Additional configuration

env
array
Additional environment variables for the build container
labels
object
Additional labels to apply to the build job
annotations
object
Additional annotations to apply to the build job

Status monitoring

Monitor the NIMBuild status to track the build progress:
kubectl get nimbuild llama-3-8b-engine -o jsonpath='{.status.state}'

Status states

  • Pending - Waiting for NIMCache to be ready or for resources
  • Started - Build job has been created
  • InProgress - Engine build is in progress
  • Ready - Engine build completed successfully
  • Failed - Build failed (check pod logs for details)
  • NotReady - Build job not yet ready

Checking build progress

View detailed status:
kubectl describe nimbuild llama-3-8b-engine
Check build pod logs:
kubectl logs -l nimbuild=llama-3-8b-engine -f

Using built engines with NIMService

Once the NIMBuild is Ready, reference it in your NIMService:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-8b-service
spec:
  image:
    repository: nvcr.io/nvidia/nim-llm
    tag: "1.2.0"
  storage:
    nimCache:
      name: llama-3-8b-cache
      profile: llama-3-8b-optimized  # Use the built engine
  resources:
    limits:
      nvidia.com/gpu: 1

Complete example

Here’s a complete example showing NIMCache, NIMBuild, and NIMService working together:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: llama-3-8b-cache
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nvidia/nim-llm:1.2.0
      pullSecret: ngc-secret
      model:
        profiles:
          - llama-3-8b-base
        precision: fp16
        tensorParallelism: 1
        pipelineParallelism: 1
        gpus:
          - product: NVIDIA-H100-80GB-HBM3
        buildable: true
  storage:
    pvc:
      create: true
      storageClass: local-path
      size: 100Gi

Troubleshooting

Build fails with “NIMCache not found”

Ensure the NIMCache resource exists and is in the same namespace:
kubectl get nimcache -n <namespace>

Build fails with “Multiple buildable profiles found”

Specify the profile field in the nimCache reference to select which profile to build.

Build pod stays in Pending

Check for resource constraints:
kubectl describe pod -l nimbuild=<name>
Common causes:
  • Insufficient GPU nodes
  • Resource requests too high
  • Node selector doesn’t match any nodes

Build fails during execution

Check the build pod logs:
kubectl logs -l nimbuild=<name>
Common issues:
  • Insufficient memory (increase memory limits)
  • Invalid model configuration
  • GPU incompatibility

Best practices

  1. Use specific GPU selectors - Target specific GPU types with nodeSelector for consistent builds
  2. Allocate sufficient resources - Building large models requires significant memory and GPU resources
  3. Monitor build time - Track build duration to optimize resource allocation
  4. Store built engines - Use persistent storage to avoid rebuilding engines
  5. Test before production - Validate built engines with test workloads before deploying to production

Build docs developers (and LLMs) love