Skip to main content
The LangSmith SDK provides powerful evaluation functions to test your LLM applications against datasets and compare different versions.

evaluate()

Run an evaluation experiment on a dataset.
import { evaluate } from "langsmith/evaluation";

const results = await evaluate(
  async (input) => {
    return await myApp(input.query);
  },
  {
    data: "my-dataset",
    evaluators: [myEvaluator],
    experimentPrefix: "my-experiment",
  }
);

Signature

function evaluate(
  target: TargetT,
  options: EvaluateOptions
): Promise<ExperimentResults>
target
TargetT
required
The target function to evaluate. Can be:
  • An async function: (input: TInput, config?: TargetConfigT) => Promise<TOutput>
  • A sync function: (input: TInput, config?: TargetConfigT) => TOutput
  • An object with invoke method
options
EvaluateOptions
required
Evaluation options
data
DataT
required
The dataset to evaluate on. Can be:
  • Dataset name (string)
  • Array of examples
  • Async iterable of examples
evaluators
EvaluatorT[]
A list of evaluators to run on each example.
summaryEvaluators
SummaryEvaluatorT[]
A list of summary evaluators to run on the entire dataset.
experimentPrefix
string
A prefix to provide for your experiment name.
description
string
A free-form description of the experiment.
metadata
KVMap
Metadata to attach to the experiment.
maxConcurrency
number
The maximum concurrency for predictions/evaluations.
targetConcurrency
number
The maximum number of concurrent predictions to run. Defaults to maxConcurrency when set.
evaluationConcurrency
number
The maximum number of concurrent evaluators to run. Defaults to maxConcurrency when set.
client
Client
The LangSmith client to use.
numRepetitions
number
default:"1"
The number of repetitions to perform. Each example will be run this many times.
includeAttachments
boolean
default:"false"
Whether to use attachments for the experiment.
ExperimentResults
object
Results of the evaluation
experimentName
string
The name of the experiment.
results
AsyncGenerator<ExperimentResultRow>
Async generator of result rows.
summaryResults
object
Summary evaluation results.

Example evaluators

Row-level evaluator

Evaluators run on each example and return feedback:
const exactMatchEvaluator = async ({ outputs, referenceOutputs }) => {
  return {
    key: "exact_match",
    score: outputs.answer === referenceOutputs.answer ? 1 : 0,
  };
};

const results = await evaluate(
  (input) => myApp(input.query),
  {
    data: "my-dataset",
    evaluators: [exactMatchEvaluator],
  }
);

Summary evaluator

Summary evaluators run on the entire dataset:
const passRateEvaluator = ({ runs, examples }) => {
  const passCount = runs.filter(run => run.outputs?.passed).length;
  return {
    key: "pass_rate",
    score: passCount / runs.length,
  };
};

const results = await evaluate(
  (input) => myApp(input.query),
  {
    data: "my-dataset",
    summaryEvaluators: [passRateEvaluator],
  }
);

Evaluator return types

Evaluators can return:
  1. Single evaluation result:
return {
  key: "accuracy",
  score: 0.95,
  comment: "Very accurate",
};
  1. Multiple results:
return {
  results: [
    { key: "accuracy", score: 0.95 },
    { key: "relevance", score: 0.88 },
  ],
};
  1. Array of results:
return [
  { key: "accuracy", score: 0.95 },
  { key: "relevance", score: 0.88 },
];

evaluateComparative()

Compare multiple experiments or model versions.
import { evaluateComparative } from "langsmith/evaluation";

const results = await evaluateComparative(
  ["experiment-1", "experiment-2"],
  {
    evaluators: [comparativeEvaluator],
    experimentPrefix: "comparison",
  }
);

Comparative evaluator

const comparativeEvaluator = ({ outputs }) => {
  // outputs is a map of experiment name to output
  const scores = {};
  
  for (const [name, output] of Object.entries(outputs)) {
    scores[name] = evaluateOutput(output);
  }
  
  return {
    key: "quality",
    scores,
  };
};

Complete example

import { evaluate } from "langsmith/evaluation";
import { Client } from "langsmith";

const client = new Client();

// Define your target
const myApp = async (input: { query: string }) => {
  // Your application logic
  return { answer: "42" };
};

// Define evaluators
const exactMatch = async ({ outputs, referenceOutputs }) => {
  return {
    key: "exact_match",
    score: outputs.answer === referenceOutputs?.answer ? 1 : 0,
  };
};

const answerLength = async ({ outputs }) => {
  return {
    key: "answer_length",
    score: outputs.answer.length,
  };
};

const passRate = ({ runs }) => {
  const passed = runs.filter(r => !r.error).length;
  return {
    key: "pass_rate",
    score: passed / runs.length,
  };
};

// Run evaluation
const results = await evaluate(
  myApp,
  {
    data: "my-dataset",
    evaluators: [exactMatch, answerLength],
    summaryEvaluators: [passRate],
    experimentPrefix: "my-experiment",
    description: "Testing my app",
    metadata: { version: "1.0" },
    maxConcurrency: 10,
    client,
  }
);

// Stream results
for await (const result of results.results) {
  console.log(result);
}

Build docs developers (and LLMs) love