NemoEvaluator

NemoEvaluator provides a comprehensive framework for assessing LLM performance using industry-standard benchmarks, custom evaluation tasks, and automated workflows. It integrates with Argo Workflows for scalable evaluation orchestration.

Overview

NemoEvaluator enables you to:

Run standard benchmarks (MMLU, HumanEval, MT-Bench, etc.)
Evaluate custom tasks and datasets
Compare model performance across versions
Assess RAG system quality
Evaluate function calling capabilities
Track evaluation history and metrics

When to Use NemoEvaluator

Model Selection

Compare different models or versions to choose the best performer

Fine-Tuning Validation

Assess if fine-tuned models improve over base models

Regression Testing

Ensure new versions don’t degrade performance

Benchmark Reporting

Generate standardized performance reports for stakeholders

Architecture

NemoEvaluator orchestrates evaluation workflows using Argo:

Configuration

Complete Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NemoEvaluator
metadata:
  name: nemoevaluator-sample
  namespace: nemo
spec:
  # Container image configuration
  image:
    repository: nvcr.io/nvidia/nemo-microservices/evaluator
    tag: "25.06"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  
  # Service exposure
  expose:
    service:
      type: ClusterIP
      port: 8000
  
  # Replica configuration
  replicas: 1
  
  # Argo Workflows configuration
  argoWorkflows:
    endpoint: https://argo-workflows-server.nemo.svc.cluster.local:2746
    serviceAccount: argo-workflows-executor
  
  # Vector database for storing embeddings
  vectorDB:
    endpoint: http://milvus.nemo.svc.cluster.local:19530
  
  # NeMo DataStore for datasets
  datastore:
    endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000/v1/hf
  
  # NeMo EntityStore for model adapters
  entitystore:
    endpoint: http://nemoentitystore-sample.nemo.svc.cluster.local:8000
  
  # PostgreSQL database for evaluation results
  databaseConfig:
    host: evaluator-pg-postgresql.nemo.svc.cluster.local
    port: 5432
    databaseName: evaldb
    credentials:
      user: evaluser
      secretName: evaluator-pg-existing-secret
      passwordKey: password
  
  # OpenTelemetry tracing
  otel:
    enabled: true
    exporterOtlpEndpoint: http://evaluator-otel-opentelemetry-collector.nemo.svc.cluster.local:4317
    exporterConfig:
      tracesExporter: otlp
      metricsExporter: otlp
      logsExporter: otlp
    logLevel: INFO
    excludedUrls:
      - health
  
  # Logging configuration
  evalLogLevel: INFO
  logHandlers: console
  consoleLogLevel: INFO
  
  # Enable validation jobs
  enableValidation: true
  
  # Evaluation framework images
  evaluationImages:
    bigcodeEvalHarness: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-bigcode:0.12.21"
    lmEvalHarness: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-lm-eval-harness:0.12.21"
    similarityMetrics: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-custom-eval:0.12.21"
    llmAsJudge: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-llm-as-a-judge:0.12.21"
    mtBench: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-llm-as-a-judge:0.12.21"
    retriever: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-retriever:0.12.21"
    rag: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-rag:0.12.21"
    bfcl: "nvcr.io/nvidia/nemo-microservices/eval-factory-benchmark-bfcl:25.6.1"
    agenticEval: "nvcr.io/nvidia/nemo-microservices/eval-factory-benchmark-agentic-eval:25.6.1"

Key Configuration Fields

spec.argoWorkflows

object

required

Argo Workflows server configuration for orchestrating evaluation jobs.

endpoint

string

required

Argo Workflows server URL (typically HTTPS on port 2746).

serviceAccount

string

required

ServiceAccount used by Argo to execute workflow pods.

spec.vectorDB

object

required

Vector database endpoint for storing and querying embeddings during RAG evaluation.

spec.datastore

object

required

NemoDatastore endpoint for accessing evaluation datasets.

spec.entitystore

object

required

NemoEntitystore endpoint for accessing model adapters to evaluate.

spec.evaluationImages

object

required

Container images for different evaluation frameworks and benchmarks.

spec.enableValidation

boolean

default:"true"

Enable dataset validation before running evaluations.

spec.evalLogLevel

string

default:"INFO"

Evaluation job log level. Options: INFO, DEBUG.

Evaluation Frameworks

NemoEvaluator supports multiple evaluation frameworks:

LM Eval Harness
BigCode Eval
MT-Bench
RAG Evaluation
Function Calling

Standard benchmarks for language models:

MMLU (Massive Multitask Language Understanding)
HellaSwag
ARC (AI2 Reasoning Challenge)
TruthfulQA
And 60+ other tasks

evaluationImages:
  lmEvalHarness: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-lm-eval-harness:0.12.21"

Code generation benchmarks:

HumanEval
MBPP (Mostly Basic Python Problems)
CodeContests

evaluationImages:
  bigcodeEvalHarness: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-bigcode:0.12.21"

Multi-turn conversation quality:

Conversation coherence
Multi-turn understanding
LLM-as-a-judge evaluation

evaluationImages:
  mtBench: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-llm-as-a-judge:0.12.21"

Retrieval-augmented generation quality:

Answer relevance
Faithfulness to sources
Retrieval precision

evaluationImages:
  rag: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-rag:0.12.21"
  retriever: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-retriever:0.12.21"

Function calling and tool use:

Berkeley Function Calling Leaderboard (BFCL)
Tool selection accuracy
Parameter extraction

evaluationImages:
  bfcl: "nvcr.io/nvidia/nemo-microservices/eval-factory-benchmark-bfcl:25.6.1"

Prerequisites

Argo Workflows

Install Argo Workflows in your cluster:

kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.0/install.yaml

Create service account for executor:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: argo-workflows-executor
  namespace: nemo
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: argo-workflows-executor
  namespace: nemo
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "watch", "list", "create", "delete"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argo-workflows-executor
  namespace: nemo
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argo-workflows-executor
subjects:
  - kind: ServiceAccount
    name: argo-workflows-executor
    namespace: nemo

Vector Database (Milvus)

For RAG evaluations, deploy Milvus:

helm repo add milvus https://zilliztech.github.io/milvus-helm/
helm install milvus milvus/milvus -n nemo

API Usage

Create Evaluation Job

curl -X POST http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/evaluations \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama3-mmlu-eval",
    "model_endpoint": "http://meta-llama-3-1-8b-instruct.nemo.svc.cluster.local:8000/v1",
    "framework": "lm-eval-harness",
    "tasks": ["mmlu"],
    "num_fewshot": 5
  }'

Monitor Evaluation

curl http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/evaluations/llama3-mmlu-eval

Retrieve Results

curl http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/evaluations/llama3-mmlu-eval/results

Compare Models

curl -X POST http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/comparisons \
  -H "Content-Type: application/json" \
  -d '{
    "evaluations": [
      "base-model-mmlu-eval",
      "finetuned-model-mmlu-eval"
    ]
  }'

Integration with Services

NemoDatastore Integration

Fetch evaluation datasets from NemoDatastore:

datastore:
  endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000/v1/hf

Datasets can be stored in HuggingFace format or custom formats.

NemoEntitystore Integration

Evaluate fine-tuned adapters:

entitystore:
  endpoint: http://nemoentitystore-sample.nemo.svc.cluster.local:8000

Retrieve adapters by name or ID for evaluation.

Vector Database

Required for RAG evaluations:

vectorDB:
  endpoint: http://milvus.nemo.svc.cluster.local:19530

Best Practices

Benchmark Selection

Choose benchmarks relevant to your use case
Use multiple metrics for comprehensive assessment
Include domain-specific custom evaluations
Track evaluation over time for trends

Resource Management

Evaluation jobs can be resource-intensive
Use appropriate node selectors for GPU jobs
Set reasonable timeouts to prevent runaway jobs
Clean up completed workflows regularly

Result Interpretation

Compare against baseline models
Look for consistent improvements across tasks
Watch for task-specific regressions
Validate results with human evaluation

Production Deployment

Run evaluations in dedicated namespace
Use separate database for eval results
Enable monitoring and alerting
Archive old evaluation results

Custom Evaluations

Create custom evaluation tasks:

# custom_eval.py
from typing import List, Dict
import requests

def evaluate_custom_task(
    model_endpoint: str,
    test_cases: List[Dict]
) -> Dict:
    """Custom evaluation logic."""
    results = []
    
    for case in test_cases:
        response = requests.post(
            f"{model_endpoint}/chat/completions",
            json={
                "messages": [{"role": "user", "content": case["input"]}]
            }
        )
        
        prediction = response.json()["choices"][0]["message"]["content"]
        score = compute_score(prediction, case["expected"])
        results.append(score)
    
    return {
        "accuracy": sum(results) / len(results),
        "num_samples": len(test_cases)
    }

Troubleshooting

Argo Workflow Fails

Check:

Argo server is running and accessible
ServiceAccount has proper permissions
Workflow templates are valid
Pod logs for specific error messages

Evaluation Timeouts

Solutions:

Increase timeout values in evaluation config
Reduce batch size or number of samples
Use faster storage for datasets
Check network connectivity to model endpoint

Database Connection Issues

Verify:

PostgreSQL is accessible
Database and user exist
Credentials are correct
Init container completed successfully

Next Steps

Deploy Models

Set up models to evaluate

Custom Metrics

Build custom evaluation tasks

Argo Workflows

Learn Argo Workflows

API Reference

Detailed API documentation

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Overview

When to Use NemoEvaluator

Model Selection

Fine-Tuning Validation

Regression Testing

Benchmark Reporting

Architecture

Configuration

Complete Example

Key Configuration Fields

Evaluation Frameworks

Prerequisites

Argo Workflows

Vector Database (Milvus)

API Usage

Create Evaluation Job

Monitor Evaluation

Retrieve Results

Compare Models

Integration with Services

NemoDatastore Integration

NemoEntitystore Integration

Vector Database

Best Practices

Custom Evaluations

Troubleshooting

Next Steps

Deploy Models

Custom Metrics

Argo Workflows

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Overview

​When to Use NemoEvaluator

Model Selection

Fine-Tuning Validation

Regression Testing

Benchmark Reporting

​Architecture

​Configuration

​Complete Example

​Key Configuration Fields

​Evaluation Frameworks

​Prerequisites

​Argo Workflows

​Vector Database (Milvus)

​API Usage

​Create Evaluation Job

​Monitor Evaluation

​Retrieve Results

​Compare Models

​Integration with Services

​NemoDatastore Integration

​NemoEntitystore Integration

​Vector Database

​Best Practices

​Custom Evaluations

​Troubleshooting

​Next Steps

Deploy Models

Custom Metrics

Argo Workflows

API Reference

Build docs developers (and LLMs) love

Overview

When to Use NemoEvaluator

Architecture

Configuration

Complete Example

Key Configuration Fields

Evaluation Frameworks

Prerequisites

Argo Workflows

Vector Database (Milvus)

API Usage

Create Evaluation Job

Monitor Evaluation

Retrieve Results

Compare Models

Integration with Services

NemoDatastore Integration

NemoEntitystore Integration

Vector Database

Best Practices

Custom Evaluations

Troubleshooting

Next Steps