NemoEvaluator provides a comprehensive framework for assessing LLM performance using industry-standard benchmarks, custom evaluation tasks, and automated workflows. It integrates with Argo Workflows for scalable evaluation orchestration.
Overview
NemoEvaluator enables you to:
Run standard benchmarks (MMLU, HumanEval, MT-Bench, etc.)
Evaluate custom tasks and datasets
Compare model performance across versions
Assess RAG system quality
Evaluate function calling capabilities
Track evaluation history and metrics
When to Use NemoEvaluator
Model Selection Compare different models or versions to choose the best performer
Fine-Tuning Validation Assess if fine-tuned models improve over base models
Regression Testing Ensure new versions don’t degrade performance
Benchmark Reporting Generate standardized performance reports for stakeholders
Architecture
NemoEvaluator orchestrates evaluation workflows using Argo:
Configuration
Complete Example
apiVersion : apps.nvidia.com/v1alpha1
kind : NemoEvaluator
metadata :
name : nemoevaluator-sample
namespace : nemo
spec :
# Container image configuration
image :
repository : nvcr.io/nvidia/nemo-microservices/evaluator
tag : "25.06"
pullPolicy : IfNotPresent
pullSecrets :
- ngc-secret
# Service exposure
expose :
service :
type : ClusterIP
port : 8000
# Replica configuration
replicas : 1
# Argo Workflows configuration
argoWorkflows :
endpoint : https://argo-workflows-server.nemo.svc.cluster.local:2746
serviceAccount : argo-workflows-executor
# Vector database for storing embeddings
vectorDB :
endpoint : http://milvus.nemo.svc.cluster.local:19530
# NeMo DataStore for datasets
datastore :
endpoint : http://nemodatastore-sample.nemo.svc.cluster.local:8000/v1/hf
# NeMo EntityStore for model adapters
entitystore :
endpoint : http://nemoentitystore-sample.nemo.svc.cluster.local:8000
# PostgreSQL database for evaluation results
databaseConfig :
host : evaluator-pg-postgresql.nemo.svc.cluster.local
port : 5432
databaseName : evaldb
credentials :
user : evaluser
secretName : evaluator-pg-existing-secret
passwordKey : password
# OpenTelemetry tracing
otel :
enabled : true
exporterOtlpEndpoint : http://evaluator-otel-opentelemetry-collector.nemo.svc.cluster.local:4317
exporterConfig :
tracesExporter : otlp
metricsExporter : otlp
logsExporter : otlp
logLevel : INFO
excludedUrls :
- health
# Logging configuration
evalLogLevel : INFO
logHandlers : console
consoleLogLevel : INFO
# Enable validation jobs
enableValidation : true
# Evaluation framework images
evaluationImages :
bigcodeEvalHarness : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-bigcode:0.12.21"
lmEvalHarness : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-lm-eval-harness:0.12.21"
similarityMetrics : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-custom-eval:0.12.21"
llmAsJudge : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-llm-as-a-judge:0.12.21"
mtBench : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-llm-as-a-judge:0.12.21"
retriever : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-retriever:0.12.21"
rag : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-rag:0.12.21"
bfcl : "nvcr.io/nvidia/nemo-microservices/eval-factory-benchmark-bfcl:25.6.1"
agenticEval : "nvcr.io/nvidia/nemo-microservices/eval-factory-benchmark-agentic-eval:25.6.1"
Key Configuration Fields
Argo Workflows server configuration for orchestrating evaluation jobs. Argo Workflows server URL (typically HTTPS on port 2746).
ServiceAccount used by Argo to execute workflow pods.
Vector database endpoint for storing and querying embeddings during RAG evaluation.
NemoDatastore endpoint for accessing evaluation datasets.
NemoEntitystore endpoint for accessing model adapters to evaluate.
Container images for different evaluation frameworks and benchmarks.
Enable dataset validation before running evaluations.
Evaluation job log level. Options: INFO, DEBUG.
Evaluation Frameworks
NemoEvaluator supports multiple evaluation frameworks:
LM Eval Harness
BigCode Eval
MT-Bench
RAG Evaluation
Function Calling
Standard benchmarks for language models:
MMLU (Massive Multitask Language Understanding)
HellaSwag
ARC (AI2 Reasoning Challenge)
TruthfulQA
And 60+ other tasks
evaluationImages :
lmEvalHarness : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-lm-eval-harness:0.12.21"
Code generation benchmarks:
HumanEval
MBPP (Mostly Basic Python Problems)
CodeContests
evaluationImages :
bigcodeEvalHarness : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-bigcode:0.12.21"
Multi-turn conversation quality:
Conversation coherence
Multi-turn understanding
LLM-as-a-judge evaluation
evaluationImages :
mtBench : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-llm-as-a-judge:0.12.21"
Retrieval-augmented generation quality:
Answer relevance
Faithfulness to sources
Retrieval precision
evaluationImages :
rag : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-rag:0.12.21"
retriever : "nvcr.io/nvidia/nemo-microservices/eval-tool-benchmark-retriever:0.12.21"
Function calling and tool use:
Berkeley Function Calling Leaderboard (BFCL)
Tool selection accuracy
Parameter extraction
evaluationImages :
bfcl : "nvcr.io/nvidia/nemo-microservices/eval-factory-benchmark-bfcl:25.6.1"
Prerequisites
Argo Workflows
Install Argo Workflows in your cluster:
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.0/install.yaml
Create service account for executor:
apiVersion : v1
kind : ServiceAccount
metadata :
name : argo-workflows-executor
namespace : nemo
---
apiVersion : rbac.authorization.k8s.io/v1
kind : Role
metadata :
name : argo-workflows-executor
namespace : nemo
rules :
- apiGroups : [ "" ]
resources : [ "pods" , "pods/log" ]
verbs : [ "get" , "watch" , "list" , "create" , "delete" ]
- apiGroups : [ "" ]
resources : [ "secrets" ]
verbs : [ "get" ]
---
apiVersion : rbac.authorization.k8s.io/v1
kind : RoleBinding
metadata :
name : argo-workflows-executor
namespace : nemo
roleRef :
apiGroup : rbac.authorization.k8s.io
kind : Role
name : argo-workflows-executor
subjects :
- kind : ServiceAccount
name : argo-workflows-executor
namespace : nemo
Vector Database (Milvus)
For RAG evaluations, deploy Milvus:
helm repo add milvus https://zilliztech.github.io/milvus-helm/
helm install milvus milvus/milvus -n nemo
API Usage
Create Evaluation Job
curl -X POST http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/evaluations \
-H "Content-Type: application/json" \
-d '{
"name": "llama3-mmlu-eval",
"model_endpoint": "http://meta-llama-3-1-8b-instruct.nemo.svc.cluster.local:8000/v1",
"framework": "lm-eval-harness",
"tasks": ["mmlu"],
"num_fewshot": 5
}'
Monitor Evaluation
curl http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/evaluations/llama3-mmlu-eval
Retrieve Results
curl http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/evaluations/llama3-mmlu-eval/results
Compare Models
curl -X POST http://nemoevaluator-sample.nemo.svc.cluster.local:8000/v1/comparisons \
-H "Content-Type: application/json" \
-d '{
"evaluations": [
"base-model-mmlu-eval",
"finetuned-model-mmlu-eval"
]
}'
Integration with Services
NemoDatastore Integration
Fetch evaluation datasets from NemoDatastore:
datastore :
endpoint : http://nemodatastore-sample.nemo.svc.cluster.local:8000/v1/hf
Datasets can be stored in HuggingFace format or custom formats.
NemoEntitystore Integration
Evaluate fine-tuned adapters:
entitystore :
endpoint : http://nemoentitystore-sample.nemo.svc.cluster.local:8000
Retrieve adapters by name or ID for evaluation.
Vector Database
Required for RAG evaluations:
vectorDB :
endpoint : http://milvus.nemo.svc.cluster.local:19530
Best Practices
Choose benchmarks relevant to your use case
Use multiple metrics for comprehensive assessment
Include domain-specific custom evaluations
Track evaluation over time for trends
Evaluation jobs can be resource-intensive
Use appropriate node selectors for GPU jobs
Set reasonable timeouts to prevent runaway jobs
Clean up completed workflows regularly
Compare against baseline models
Look for consistent improvements across tasks
Watch for task-specific regressions
Validate results with human evaluation
Run evaluations in dedicated namespace
Use separate database for eval results
Enable monitoring and alerting
Archive old evaluation results
Custom Evaluations
Create custom evaluation tasks:
# custom_eval.py
from typing import List, Dict
import requests
def evaluate_custom_task (
model_endpoint : str ,
test_cases : List[Dict]
) -> Dict:
"""Custom evaluation logic."""
results = []
for case in test_cases:
response = requests.post(
f " { model_endpoint } /chat/completions" ,
json = {
"messages" : [{ "role" : "user" , "content" : case[ "input" ]}]
}
)
prediction = response.json()[ "choices" ][ 0 ][ "message" ][ "content" ]
score = compute_score(prediction, case[ "expected" ])
results.append(score)
return {
"accuracy" : sum (results) / len (results),
"num_samples" : len (test_cases)
}
Troubleshooting
Check:
Argo server is running and accessible
ServiceAccount has proper permissions
Workflow templates are valid
Pod logs for specific error messages
Solutions:
Increase timeout values in evaluation config
Reduce batch size or number of samples
Use faster storage for datasets
Check network connectivity to model endpoint
Database Connection Issues
Verify:
PostgreSQL is accessible
Database and user exist
Credentials are correct
Init container completed successfully
Next Steps
Deploy Models Set up models to evaluate
Custom Metrics Build custom evaluation tasks
Argo Workflows Learn Argo Workflows
API Reference Detailed API documentation