Evaluation is essential for building reliable LLM applications. LangSmith provides a comprehensive framework for running evaluations, from simple assertions to complex LLM-as-judge patterns.
What is evaluation?
Evaluation measures how well your application performs on specific tasks or datasets. In LangSmith, you:
- Define a target function (the system you’re testing)
- Create or use a dataset (test cases with expected inputs/outputs)
- Write evaluators (functions that score outputs)
- Run the evaluation and analyze results
Evaluators
An evaluator is a function that takes a run and optionally an example, then returns a score or feedback.
Basic evaluator structure
from langsmith.schemas import Run, Example
def exact_match(run: Run, example: Example) -> dict:
"""Check if output exactly matches expected output."""
prediction = run.outputs.get("answer")
expected = example.outputs.get("answer")
return {
"key": "exact_match",
"score": 1 if prediction == expected else 0,
}
Evaluators return a result with these fields:
interface EvaluationResult {
// Required: unique identifier for this evaluation metric
key: string;
// Optional: numeric or boolean score
score?: number | boolean;
// Optional: additional value (can be any type)
value?: any;
// Optional: explanation or comment
comment?: string;
// Optional: suggested correction
correction?: Record<string, unknown>;
// Optional: ID of the run that produced this evaluation
sourceRunId?: string;
// Optional: ID of the run being evaluated (defaults to root run)
targetRunId?: string;
}
Multiple scores from one evaluator
Return multiple metrics from a single evaluator:
def comprehensive_eval(run: Run, example: Example) -> dict:
"""Return multiple evaluation metrics."""
prediction = run.outputs.get("answer", "")
expected = example.outputs.get("answer", "")
return {
"results": [
{
"key": "exact_match",
"score": 1 if prediction == expected else 0,
},
{
"key": "length_difference",
"score": abs(len(prediction) - len(expected)),
},
{
"key": "has_content",
"score": len(prediction.strip()) > 0,
},
]
}
Running evaluations
Using client.evaluate()
The evaluate() method runs your target function on a dataset and applies evaluators:
from langsmith import Client
client = Client()
# Define your target function
def my_agent(inputs: dict) -> dict:
# Your application logic
question = inputs["question"]
answer = process_question(question)
return {"answer": answer}
# Run evaluation
results = client.evaluate(
my_agent,
data="my-dataset-name", # or dataset_id
evaluators=[exact_match, comprehensive_eval],
experiment_prefix="agent-v1",
metadata={"version": "1.0", "model": "gpt-4"},
)
# Access results
for result in results:
print(f"Example {result['example'].id}: {result['evaluation_results']}")
Evaluating existing runs
Evaluate runs that have already been traced:
results = client.evaluate_existing(
project_name="my-production-project",
evaluators=[quality_check, safety_check],
experiment_prefix="prod-eval",
)
Evaluation patterns
LLM-as-judge evaluators
Use an LLM to evaluate outputs:
from langsmith import traceable
import openai
@traceable(run_type="llm")
def llm_judge(run: Run, example: Example) -> dict:
"""Use GPT-4 to evaluate response quality."""
prediction = run.outputs.get("answer")
expected = example.outputs.get("answer")
prompt = f"""Compare the predicted answer to the expected answer.
Predicted: {prediction}
Expected: {expected}
Rate the prediction from 0-10, where 10 is perfect.
Respond with just the number."""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
)
score = float(response.choices[0].message.content.strip())
return {
"key": "llm_judge_score",
"score": score,
"comment": f"LLM rated this {score}/10",
}
When using LLM-as-judge, wrap your evaluator with @traceable to trace the evaluation LLM calls separately.
Reference-free evaluators
Evaluators don’t always need expected outputs:
def check_safety(run: Run, example: Example) -> dict:
"""Check if response contains unsafe content."""
response = run.outputs.get("answer", "")
unsafe_keywords = ["dangerous", "harmful", "illegal"]
is_safe = not any(keyword in response.lower() for keyword in unsafe_keywords)
return {
"key": "is_safe",
"score": is_safe,
"comment": "Response is safe" if is_safe else "Response contains unsafe content",
}
def check_latency(run: Run, example: Example) -> dict:
"""Check if response was generated quickly enough."""
latency = (run.end_time - run.start_time).total_seconds()
return {
"key": "latency_ok",
"score": latency < 5.0, # Under 5 seconds
"value": latency,
"comment": f"Response took {latency:.2f}s",
}
Experiment tracking
Each evaluation creates an experiment that tracks all runs and scores:
# Run multiple experiments to compare approaches
results_v1 = client.evaluate(
agent_v1,
data="my-dataset",
evaluators=[accuracy, latency],
experiment_prefix="baseline",
)
results_v2 = client.evaluate(
agent_v2,
data="my-dataset",
evaluators=[accuracy, latency],
experiment_prefix="optimized",
)
# View results in the LangSmith UI to compare
Experiments help you:
- Track performance over time
- Compare different models or prompts
- Identify regressions
- Make data-driven decisions
Evaluation configuration
Customize evaluation behavior:
results = client.evaluate(
my_function,
data="my-dataset",
evaluators=[evaluator1, evaluator2],
experiment_prefix="test-run",
# Concurrency settings
max_concurrency=5, # Run 5 examples in parallel
# Metadata
metadata={
"version": "2.0",
"git_commit": "abc123",
"model": "gpt-4-turbo",
},
# Client override
client=custom_client,
)
Best practices
Start simple, then iterateBegin with basic evaluators (exact match, keyword presence) before adding complex LLM-as-judge evaluators.
Use multiple evaluatorsEvaluate different aspects: correctness, safety, latency, cost. No single metric tells the whole story.
Version your evaluatorsAs you improve evaluators, track which version was used for each experiment to ensure fair comparisons.
LLM-as-judge evaluators can be expensive and slow. Consider using them on a sample of your dataset first.
Next steps
- Learn about datasets for organizing test cases
- Explore tracing to understand what evaluators receive
- Review evaluation results in the LangSmith UI