Evaluation

Evaluation is essential for building reliable LLM applications. LangSmith provides a comprehensive framework for running evaluations, from simple assertions to complex LLM-as-judge patterns.

What is evaluation?

Evaluation measures how well your application performs on specific tasks or datasets. In LangSmith, you:

Define a target function (the system you’re testing)
Create or use a dataset (test cases with expected inputs/outputs)
Write evaluators (functions that score outputs)
Run the evaluation and analyze results

Evaluators

An evaluator is a function that takes a run and optionally an example, then returns a score or feedback.

Basic evaluator structure

from langsmith.schemas import Run, Example

def exact_match(run: Run, example: Example) -> dict:
    """Check if output exactly matches expected output."""
    prediction = run.outputs.get("answer")
    expected = example.outputs.get("answer")
    
    return {
        "key": "exact_match",
        "score": 1 if prediction == expected else 0,
    }

Evaluation result format

Evaluators return a result with these fields:

interface EvaluationResult {
  // Required: unique identifier for this evaluation metric
  key: string;
  
  // Optional: numeric or boolean score
  score?: number | boolean;
  
  // Optional: additional value (can be any type)
  value?: any;
  
  // Optional: explanation or comment
  comment?: string;
  
  // Optional: suggested correction
  correction?: Record<string, unknown>;
  
  // Optional: ID of the run that produced this evaluation
  sourceRunId?: string;
  
  // Optional: ID of the run being evaluated (defaults to root run)
  targetRunId?: string;
}

Multiple scores from one evaluator

Return multiple metrics from a single evaluator:

def comprehensive_eval(run: Run, example: Example) -> dict:
    """Return multiple evaluation metrics."""
    prediction = run.outputs.get("answer", "")
    expected = example.outputs.get("answer", "")
    
    return {
        "results": [
            {
                "key": "exact_match",
                "score": 1 if prediction == expected else 0,
            },
            {
                "key": "length_difference",
                "score": abs(len(prediction) - len(expected)),
            },
            {
                "key": "has_content",
                "score": len(prediction.strip()) > 0,
            },
        ]
    }

Running evaluations

Using `client.evaluate()`

The evaluate() method runs your target function on a dataset and applies evaluators:

from langsmith import Client

client = Client()

# Define your target function
def my_agent(inputs: dict) -> dict:
    # Your application logic
    question = inputs["question"]
    answer = process_question(question)
    return {"answer": answer}

# Run evaluation
results = client.evaluate(
    my_agent,
    data="my-dataset-name",  # or dataset_id
    evaluators=[exact_match, comprehensive_eval],
    experiment_prefix="agent-v1",
    metadata={"version": "1.0", "model": "gpt-4"},
)

# Access results
for result in results:
    print(f"Example {result['example'].id}: {result['evaluation_results']}")

Evaluating existing runs

Evaluate runs that have already been traced:

results = client.evaluate_existing(
    project_name="my-production-project",
    evaluators=[quality_check, safety_check],
    experiment_prefix="prod-eval",
)

Evaluation patterns

LLM-as-judge evaluators

Use an LLM to evaluate outputs:

from langsmith import traceable
import openai

@traceable(run_type="llm")
def llm_judge(run: Run, example: Example) -> dict:
    """Use GPT-4 to evaluate response quality."""
    prediction = run.outputs.get("answer")
    expected = example.outputs.get("answer")
    
    prompt = f"""Compare the predicted answer to the expected answer.
    
Predicted: {prediction}
Expected: {expected}

Rate the prediction from 0-10, where 10 is perfect.
Respond with just the number."""
    
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
    )
    
    score = float(response.choices[0].message.content.strip())
    
    return {
        "key": "llm_judge_score",
        "score": score,
        "comment": f"LLM rated this {score}/10",
    }

When using LLM-as-judge, wrap your evaluator with @traceable to trace the evaluation LLM calls separately.

Reference-free evaluators

Evaluators don’t always need expected outputs:

def check_safety(run: Run, example: Example) -> dict:
    """Check if response contains unsafe content."""
    response = run.outputs.get("answer", "")
    
    unsafe_keywords = ["dangerous", "harmful", "illegal"]
    is_safe = not any(keyword in response.lower() for keyword in unsafe_keywords)
    
    return {
        "key": "is_safe",
        "score": is_safe,
        "comment": "Response is safe" if is_safe else "Response contains unsafe content",
    }

def check_latency(run: Run, example: Example) -> dict:
    """Check if response was generated quickly enough."""
    latency = (run.end_time - run.start_time).total_seconds()
    
    return {
        "key": "latency_ok",
        "score": latency < 5.0,  # Under 5 seconds
        "value": latency,
        "comment": f"Response took {latency:.2f}s",
    }

Experiment tracking

Each evaluation creates an experiment that tracks all runs and scores:

# Run multiple experiments to compare approaches
results_v1 = client.evaluate(
    agent_v1,
    data="my-dataset",
    evaluators=[accuracy, latency],
    experiment_prefix="baseline",
)

results_v2 = client.evaluate(
    agent_v2,
    data="my-dataset",
    evaluators=[accuracy, latency],
    experiment_prefix="optimized",
)

# View results in the LangSmith UI to compare

Experiments help you:

Track performance over time
Compare different models or prompts
Identify regressions
Make data-driven decisions

Evaluation configuration

Customize evaluation behavior:

results = client.evaluate(
    my_function,
    data="my-dataset",
    evaluators=[evaluator1, evaluator2],
    experiment_prefix="test-run",
    
    # Concurrency settings
    max_concurrency=5,  # Run 5 examples in parallel
    
    # Metadata
    metadata={
        "version": "2.0",
        "git_commit": "abc123",
        "model": "gpt-4-turbo",
    },
    
    # Client override
    client=custom_client,
)

Best practices

Start simple, then iterateBegin with basic evaluators (exact match, keyword presence) before adding complex LLM-as-judge evaluators.

Use multiple evaluatorsEvaluate different aspects: correctness, safety, latency, cost. No single metric tells the whole story.

Version your evaluatorsAs you improve evaluators, track which version was used for each experiment to ensure fair comparisons.

LLM-as-judge evaluators can be expensive and slow. Consider using them on a sample of your dataset first.

Next steps

Learn about datasets for organizing test cases
Explore tracing to understand what evaluators receive
Review evaluation results in the LangSmith UI

Get Started

Core Concepts

Guides

What is evaluation?

Evaluators

Basic evaluator structure

Evaluation result format

Multiple scores from one evaluator

Running evaluations

Using `client.evaluate()`

Evaluating existing runs

Evaluation patterns

LLM-as-judge evaluators

Reference-free evaluators

Experiment tracking

Evaluation configuration

Best practices

Next steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​What is evaluation?

​Evaluators

​Basic evaluator structure

​Evaluation result format

​Multiple scores from one evaluator

​Running evaluations

​Using client.evaluate()

​Evaluating existing runs

​Evaluation patterns

​LLM-as-judge evaluators

​Reference-free evaluators

​Experiment tracking

​Evaluation configuration

​Best practices

​Next steps

Build docs developers (and LLMs) love

What is evaluation?

Evaluators

Basic evaluator structure

Evaluation result format

Multiple scores from one evaluator

Running evaluations

Using `client.evaluate()`

Evaluating existing runs

Evaluation patterns

LLM-as-judge evaluators

Reference-free evaluators

Experiment tracking

Evaluation configuration

Best practices

Next steps