Skip to main content
Evaluation is essential for building reliable LLM applications. LangSmith provides a comprehensive framework for running evaluations, from simple assertions to complex LLM-as-judge patterns.

What is evaluation?

Evaluation measures how well your application performs on specific tasks or datasets. In LangSmith, you:
  1. Define a target function (the system you’re testing)
  2. Create or use a dataset (test cases with expected inputs/outputs)
  3. Write evaluators (functions that score outputs)
  4. Run the evaluation and analyze results

Evaluators

An evaluator is a function that takes a run and optionally an example, then returns a score or feedback.

Basic evaluator structure

from langsmith.schemas import Run, Example

def exact_match(run: Run, example: Example) -> dict:
    """Check if output exactly matches expected output."""
    prediction = run.outputs.get("answer")
    expected = example.outputs.get("answer")
    
    return {
        "key": "exact_match",
        "score": 1 if prediction == expected else 0,
    }

Evaluation result format

Evaluators return a result with these fields:
interface EvaluationResult {
  // Required: unique identifier for this evaluation metric
  key: string;
  
  // Optional: numeric or boolean score
  score?: number | boolean;
  
  // Optional: additional value (can be any type)
  value?: any;
  
  // Optional: explanation or comment
  comment?: string;
  
  // Optional: suggested correction
  correction?: Record<string, unknown>;
  
  // Optional: ID of the run that produced this evaluation
  sourceRunId?: string;
  
  // Optional: ID of the run being evaluated (defaults to root run)
  targetRunId?: string;
}

Multiple scores from one evaluator

Return multiple metrics from a single evaluator:
def comprehensive_eval(run: Run, example: Example) -> dict:
    """Return multiple evaluation metrics."""
    prediction = run.outputs.get("answer", "")
    expected = example.outputs.get("answer", "")
    
    return {
        "results": [
            {
                "key": "exact_match",
                "score": 1 if prediction == expected else 0,
            },
            {
                "key": "length_difference",
                "score": abs(len(prediction) - len(expected)),
            },
            {
                "key": "has_content",
                "score": len(prediction.strip()) > 0,
            },
        ]
    }

Running evaluations

Using client.evaluate()

The evaluate() method runs your target function on a dataset and applies evaluators:
from langsmith import Client

client = Client()

# Define your target function
def my_agent(inputs: dict) -> dict:
    # Your application logic
    question = inputs["question"]
    answer = process_question(question)
    return {"answer": answer}

# Run evaluation
results = client.evaluate(
    my_agent,
    data="my-dataset-name",  # or dataset_id
    evaluators=[exact_match, comprehensive_eval],
    experiment_prefix="agent-v1",
    metadata={"version": "1.0", "model": "gpt-4"},
)

# Access results
for result in results:
    print(f"Example {result['example'].id}: {result['evaluation_results']}")

Evaluating existing runs

Evaluate runs that have already been traced:
results = client.evaluate_existing(
    project_name="my-production-project",
    evaluators=[quality_check, safety_check],
    experiment_prefix="prod-eval",
)

Evaluation patterns

LLM-as-judge evaluators

Use an LLM to evaluate outputs:
from langsmith import traceable
import openai

@traceable(run_type="llm")
def llm_judge(run: Run, example: Example) -> dict:
    """Use GPT-4 to evaluate response quality."""
    prediction = run.outputs.get("answer")
    expected = example.outputs.get("answer")
    
    prompt = f"""Compare the predicted answer to the expected answer.
    
Predicted: {prediction}
Expected: {expected}

Rate the prediction from 0-10, where 10 is perfect.
Respond with just the number."""
    
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
    )
    
    score = float(response.choices[0].message.content.strip())
    
    return {
        "key": "llm_judge_score",
        "score": score,
        "comment": f"LLM rated this {score}/10",
    }
When using LLM-as-judge, wrap your evaluator with @traceable to trace the evaluation LLM calls separately.

Reference-free evaluators

Evaluators don’t always need expected outputs:
def check_safety(run: Run, example: Example) -> dict:
    """Check if response contains unsafe content."""
    response = run.outputs.get("answer", "")
    
    unsafe_keywords = ["dangerous", "harmful", "illegal"]
    is_safe = not any(keyword in response.lower() for keyword in unsafe_keywords)
    
    return {
        "key": "is_safe",
        "score": is_safe,
        "comment": "Response is safe" if is_safe else "Response contains unsafe content",
    }

def check_latency(run: Run, example: Example) -> dict:
    """Check if response was generated quickly enough."""
    latency = (run.end_time - run.start_time).total_seconds()
    
    return {
        "key": "latency_ok",
        "score": latency < 5.0,  # Under 5 seconds
        "value": latency,
        "comment": f"Response took {latency:.2f}s",
    }

Experiment tracking

Each evaluation creates an experiment that tracks all runs and scores:
# Run multiple experiments to compare approaches
results_v1 = client.evaluate(
    agent_v1,
    data="my-dataset",
    evaluators=[accuracy, latency],
    experiment_prefix="baseline",
)

results_v2 = client.evaluate(
    agent_v2,
    data="my-dataset",
    evaluators=[accuracy, latency],
    experiment_prefix="optimized",
)

# View results in the LangSmith UI to compare
Experiments help you:
  • Track performance over time
  • Compare different models or prompts
  • Identify regressions
  • Make data-driven decisions

Evaluation configuration

Customize evaluation behavior:
results = client.evaluate(
    my_function,
    data="my-dataset",
    evaluators=[evaluator1, evaluator2],
    experiment_prefix="test-run",
    
    # Concurrency settings
    max_concurrency=5,  # Run 5 examples in parallel
    
    # Metadata
    metadata={
        "version": "2.0",
        "git_commit": "abc123",
        "model": "gpt-4-turbo",
    },
    
    # Client override
    client=custom_client,
)

Best practices

Start simple, then iterateBegin with basic evaluators (exact match, keyword presence) before adding complex LLM-as-judge evaluators.
Use multiple evaluatorsEvaluate different aspects: correctness, safety, latency, cost. No single metric tells the whole story.
Version your evaluatorsAs you improve evaluators, track which version was used for each experiment to ensure fair comparisons.
LLM-as-judge evaluators can be expensive and slow. Consider using them on a sample of your dataset first.

Next steps

  • Learn about datasets for organizing test cases
  • Explore tracing to understand what evaluators receive
  • Review evaluation results in the LangSmith UI

Build docs developers (and LLMs) love