Evaluation functions

The LangSmith SDK provides powerful evaluation functions to test your LLM applications against datasets and compare different versions.

evaluate()

Run an evaluation experiment on a dataset.

import { evaluate } from "langsmith/evaluation";

const results = await evaluate(
  async (input) => {
    return await myApp(input.query);
  },
  {
    data: "my-dataset",
    evaluators: [myEvaluator],
    experimentPrefix: "my-experiment",
  }
);

Signature

function evaluate(
  target: TargetT,
  options: EvaluateOptions
): Promise<ExperimentResults>

target

TargetT

required

The target function to evaluate. Can be:

An async function: (input: TInput, config?: TargetConfigT) => Promise<TOutput>
A sync function: (input: TInput, config?: TargetConfigT) => TOutput
An object with invoke method

options

EvaluateOptions

required

Evaluation options

EvaluateOptions properties

data

DataT

required

The dataset to evaluate on. Can be:

Dataset name (string)
Array of examples
Async iterable of examples

evaluators

EvaluatorT[]

A list of evaluators to run on each example.

summaryEvaluators

SummaryEvaluatorT[]

A list of summary evaluators to run on the entire dataset.

experimentPrefix

string

A prefix to provide for your experiment name.

description

string

A free-form description of the experiment.

metadata

KVMap

Metadata to attach to the experiment.

maxConcurrency

number

The maximum concurrency for predictions/evaluations.

targetConcurrency

number

The maximum number of concurrent predictions to run. Defaults to maxConcurrency when set.

evaluationConcurrency

number

The maximum number of concurrent evaluators to run. Defaults to maxConcurrency when set.

client

Client

The LangSmith client to use.

numRepetitions

number

default:"1"

The number of repetitions to perform. Each example will be run this many times.

includeAttachments

boolean

default:"false"

Whether to use attachments for the experiment.

ExperimentResults

object

Results of the evaluation

ExperimentResults properties

experimentName

string

The name of the experiment.

results

AsyncGenerator<ExperimentResultRow>

Async generator of result rows.

summaryResults

object

Summary evaluation results.

Example evaluators

Row-level evaluator

Evaluators run on each example and return feedback:

const exactMatchEvaluator = async ({ outputs, referenceOutputs }) => {
  return {
    key: "exact_match",
    score: outputs.answer === referenceOutputs.answer ? 1 : 0,
  };
};

const results = await evaluate(
  (input) => myApp(input.query),
  {
    data: "my-dataset",
    evaluators: [exactMatchEvaluator],
  }
);

Summary evaluator

Summary evaluators run on the entire dataset:

const passRateEvaluator = ({ runs, examples }) => {
  const passCount = runs.filter(run => run.outputs?.passed).length;
  return {
    key: "pass_rate",
    score: passCount / runs.length,
  };
};

const results = await evaluate(
  (input) => myApp(input.query),
  {
    data: "my-dataset",
    summaryEvaluators: [passRateEvaluator],
  }
);

Evaluator return types

Evaluators can return:

Single evaluation result:

return {
  key: "accuracy",
  score: 0.95,
  comment: "Very accurate",
};

Multiple results:

return {
  results: [
    { key: "accuracy", score: 0.95 },
    { key: "relevance", score: 0.88 },
  ],
};

Array of results:

return [
  { key: "accuracy", score: 0.95 },
  { key: "relevance", score: 0.88 },
];

evaluateComparative()

Compare multiple experiments or model versions.

import { evaluateComparative } from "langsmith/evaluation";

const results = await evaluateComparative(
  ["experiment-1", "experiment-2"],
  {
    evaluators: [comparativeEvaluator],
    experimentPrefix: "comparison",
  }
);

Comparative evaluator

const comparativeEvaluator = ({ outputs }) => {
  // outputs is a map of experiment name to output
  const scores = {};
  
  for (const [name, output] of Object.entries(outputs)) {
    scores[name] = evaluateOutput(output);
  }
  
  return {
    key: "quality",
    scores,
  };
};

Complete example

import { evaluate } from "langsmith/evaluation";
import { Client } from "langsmith";

const client = new Client();

// Define your target
const myApp = async (input: { query: string }) => {
  // Your application logic
  return { answer: "42" };
};

// Define evaluators
const exactMatch = async ({ outputs, referenceOutputs }) => {
  return {
    key: "exact_match",
    score: outputs.answer === referenceOutputs?.answer ? 1 : 0,
  };
};

const answerLength = async ({ outputs }) => {
  return {
    key: "answer_length",
    score: outputs.answer.length,
  };
};

const passRate = ({ runs }) => {
  const passed = runs.filter(r => !r.error).length;
  return {
    key: "pass_rate",
    score: passed / runs.length,
  };
};

// Run evaluation
const results = await evaluate(
  myApp,
  {
    data: "my-dataset",
    evaluators: [exactMatch, answerLength],
    summaryEvaluators: [passRate],
    experimentPrefix: "my-experiment",
    description: "Testing my app",
    metadata: { version: "1.0" },
    maxConcurrency: 10,
    client,
  }
);

// Stream results
for await (const result of results.results) {
  console.log(result);
}

Client

Tracing

Evaluation

Wrappers

Test Integrations

Utilities

evaluate()

Signature

Example evaluators

Row-level evaluator

Summary evaluator

Evaluator return types

evaluateComparative()

Comparative evaluator

Complete example

Build docs developers (and LLMs) love

Client

Tracing

Evaluation

Wrappers

Test Integrations

Utilities

​evaluate()

​Signature

​Example evaluators

​Row-level evaluator

​Summary evaluator

​Evaluator return types

​evaluateComparative()

​Comparative evaluator

​Complete example

Build docs developers (and LLMs) love

evaluate()

Signature

Example evaluators

Row-level evaluator

Summary evaluator

Evaluator return types

evaluateComparative()

Comparative evaluator

Complete example