Skip to main content
NemoEvaluator provides a comprehensive framework for assessing LLM performance using industry-standard benchmarks, custom evaluation tasks, and automated workflows. It integrates with Argo Workflows for scalable evaluation orchestration.

Overview

NemoEvaluator enables you to:
  • Run standard benchmarks (MMLU, HumanEval, MT-Bench, etc.)
  • Evaluate custom tasks and datasets
  • Compare model performance across versions
  • Assess RAG system quality
  • Evaluate function calling capabilities
  • Track evaluation history and metrics

When to Use NemoEvaluator

Model Selection

Compare different models or versions to choose the best performer

Fine-Tuning Validation

Assess if fine-tuned models improve over base models

Regression Testing

Ensure new versions don’t degrade performance

Benchmark Reporting

Generate standardized performance reports for stakeholders

Architecture

NemoEvaluator orchestrates evaluation workflows using Argo:

Configuration

Complete Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NemoEvaluator
metadata:
  name: nemoevaluator-sample
  namespace: nemo
spec:
  # Container image configuration
  image:
    repository: nvcr.io/nvidia/nemo-microservices/evaluator
    tag: "25.06"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  
  # Service exposure
  expose:
    service:
      type: ClusterIP
      port: 8000
  
  # Replica configuration
  replicas: 1
  
  # Argo Workflows configuration
  argoWorkflows:
    endpoint: https://argo-workflows-server.nemo.svc.cluster.local:2746
    serviceAccount: argo-workflows-executor
  
  # Vector database for storing embeddings
  vectorDB:
    endpoint: http://milvus.nemo.svc.cluster.local:19530
  
  # NeMo DataStore for datasets
  datastore:
    endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000/v1/hf
  
  # NeMo EntityStore for model adapters
  entitystore:
    endpoint: http://nemoentitystore-sample.nemo.svc.cluster.local:8000
  
  # PostgreSQL database for evaluation results
  databaseConfig:
    host: evaluator-pg-postgresql.nemo.svc.cluster.local
    port: 5432
    databaseName: evaldb
    credentials:
      user: evaluser
      secretName: evaluator-pg-existing-secret
      passwordKey: password
  
  # OpenTelemetry tracing
  otel:
    enabled: true
    exporterOtlpEndpoint: http://evaluator-otel-opentelemetry-collector.nemo.svc.cluster.local:4317
    exporterConfig:
      tracesExporter: otlp
      metricsExporter: otlp
      logsExporter: otlp
    logLevel: INFO
    excludedUrls:
      - health
  
  # Logging configuration
  evalLogLevel: INFO
  logHandlers: console
  consoleLogLevel: INFO
  
  # Enable validation jobs
  enableValidation: true
  
  # Evaluation framework images
  evaluationImages:
    bigcodeEvalHarness: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-bigcode:0.12.21"
    lmEvalHarness: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-lm-eval-harness:0.12.21"
    similarityMetrics: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-custom-eval:0.12.21"
    llmAsJudge: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-llm-as-a-judge:0.12.21"
    mtBench: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-llm-as-a-judge:0.12.21"
    retriever: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-retriever:0.12.21"
    rag: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-rag:0.12.21"
    bfcl: "nvcr.io/nvidia/nemo-microservices/eval-factory-benchmark-bfcl:25.6.1"
    agenticEval: "nvcr.io/nvidia/nemo-microservices/eval-factory-benchmark-agentic-eval:25.6.1"

Key Configuration Fields

spec.argoWorkflows
object
required
Argo Workflows server configuration for orchestrating evaluation jobs.
endpoint
string
required
Argo Workflows server URL (typically HTTPS on port 2746).
serviceAccount
string
required
ServiceAccount used by Argo to execute workflow pods.
spec.vectorDB
object
required
Vector database endpoint for storing and querying embeddings during RAG evaluation.
spec.datastore
object
required
NemoDatastore endpoint for accessing evaluation datasets.
spec.entitystore
object
required
NemoEntitystore endpoint for accessing model adapters to evaluate.
spec.evaluationImages
object
required
Container images for different evaluation frameworks and benchmarks.
spec.enableValidation
boolean
default:"true"
Enable dataset validation before running evaluations.
spec.evalLogLevel
string
default:"INFO"
Evaluation job log level. Options: INFO, DEBUG.

Evaluation Frameworks

NemoEvaluator supports multiple evaluation frameworks:
Standard benchmarks for language models:
  • MMLU (Massive Multitask Language Understanding)
  • HellaSwag
  • ARC (AI2 Reasoning Challenge)
  • TruthfulQA
  • And 60+ other tasks
evaluationImages:
  lmEvalHarness: "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-lm-eval-harness:0.12.21"

Prerequisites

Argo Workflows

Install Argo Workflows in your cluster:
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.0/install.yaml
Create service account for executor:
apiVersion: v1
kind: ServiceAccount
metadata:
  name: argo-workflows-executor
  namespace: nemo
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: argo-workflows-executor
  namespace: nemo
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "watch", "list", "create", "delete"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argo-workflows-executor
  namespace: nemo
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argo-workflows-executor
subjects:
  - kind: ServiceAccount
    name: argo-workflows-executor
    namespace: nemo

Vector Database (Milvus)

For RAG evaluations, deploy Milvus:
helm repo add milvus https://zilliztech.github.io/milvus-helm/
helm install milvus milvus/milvus -n nemo

API Usage

Create Evaluation Job

curl -X POST http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/evaluations \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama3-mmlu-eval",
    "model_endpoint": "http://meta-llama-3-1-8b-instruct.nemo.svc.cluster.local:8000/v1",
    "framework": "lm-eval-harness",
    "tasks": ["mmlu"],
    "num_fewshot": 5
  }'

Monitor Evaluation

curl http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/evaluations/llama3-mmlu-eval

Retrieve Results

curl http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/evaluations/llama3-mmlu-eval/results

Compare Models

curl -X POST http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/comparisons \
  -H "Content-Type: application/json" \
  -d '{
    "evaluations": [
      "base-model-mmlu-eval",
      "finetuned-model-mmlu-eval"
    ]
  }'

Integration with Services

NemoDatastore Integration

Fetch evaluation datasets from NemoDatastore:
datastore:
  endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000/v1/hf
Datasets can be stored in HuggingFace format or custom formats.

NemoEntitystore Integration

Evaluate fine-tuned adapters:
entitystore:
  endpoint: http://nemoentitystore-sample.nemo.svc.cluster.local:8000
Retrieve adapters by name or ID for evaluation.

Vector Database

Required for RAG evaluations:
vectorDB:
  endpoint: http://milvus.nemo.svc.cluster.local:19530

Best Practices

  • Choose benchmarks relevant to your use case
  • Use multiple metrics for comprehensive assessment
  • Include domain-specific custom evaluations
  • Track evaluation over time for trends
  • Evaluation jobs can be resource-intensive
  • Use appropriate node selectors for GPU jobs
  • Set reasonable timeouts to prevent runaway jobs
  • Clean up completed workflows regularly
  • Compare against baseline models
  • Look for consistent improvements across tasks
  • Watch for task-specific regressions
  • Validate results with human evaluation
  • Run evaluations in dedicated namespace
  • Use separate database for eval results
  • Enable monitoring and alerting
  • Archive old evaluation results

Custom Evaluations

Create custom evaluation tasks:
# custom_eval.py
from typing import List, Dict
import requests

def evaluate_custom_task(
    model_endpoint: str,
    test_cases: List[Dict]
) -> Dict:
    """Custom evaluation logic."""
    results = []
    
    for case in test_cases:
        response = requests.post(
            f"{model_endpoint}/chat/completions",
            json={
                "messages": [{"role": "user", "content": case["input"]}]
            }
        )
        
        prediction = response.json()["choices"][0]["message"]["content"]
        score = compute_score(prediction, case["expected"])
        results.append(score)
    
    return {
        "accuracy": sum(results) / len(results),
        "num_samples": len(test_cases)
    }

Troubleshooting

Check:
  • Argo server is running and accessible
  • ServiceAccount has proper permissions
  • Workflow templates are valid
  • Pod logs for specific error messages
Solutions:
  • Increase timeout values in evaluation config
  • Reduce batch size or number of samples
  • Use faster storage for datasets
  • Check network connectivity to model endpoint
Verify:
  • PostgreSQL is accessible
  • Database and user exist
  • Credentials are correct
  • Init container completed successfully

Next Steps

Deploy Models

Set up models to evaluate

Custom Metrics

Build custom evaluation tasks

Argo Workflows

Learn Argo Workflows

API Reference

Detailed API documentation

Build docs developers (and LLMs) love