Skip to main content
The LangSmith SDK provides several types for creating custom evaluators that can assess your LLM application’s outputs.

EvaluationResult

The result returned by an evaluator.
interface EvaluationResult {
  key: string;
  score?: number | boolean | null;
  value?: number | boolean | string | object | null;
  comment?: string;
  correction?: Record<string, unknown>;
  evaluatorInfo?: Record<string, unknown>;
  sourceRunId?: string;
  targetRunId?: string;
  feedbackConfig?: FeedbackConfig;
}
key
string
required
The name of the evaluation metric (e.g., “accuracy”, “relevance”).
score
number | boolean | null
The numeric or boolean score for the evaluation.
value
number | boolean | string | object | null
The value of the evaluation result. Can be used for non-numeric metrics.
comment
string
A comment or explanation for the evaluation.
correction
Record<string, unknown>
A correction record if the output should be modified.
evaluatorInfo
Record<string, unknown>
Information about the evaluator that produced this result.
sourceRunId
string
The source run ID of the evaluation result. If set, a link to the source run will be available in the UI.
targetRunId
string
The target run ID of the evaluation result. If not set, the target run ID is assumed to be the root of the trace.
feedbackConfig
FeedbackConfig
Configuration that defines how a feedback key should be interpreted.

EvaluationResults

Batch evaluation results, if your evaluator wishes to return multiple scores.
interface EvaluationResults {
  results: Array<EvaluationResult>;
}

RunEvaluator

Interface for evaluators that assess individual runs.
interface RunEvaluator {
  evaluateRun(
    run: Run,
    example?: Example,
    options?: Partial<RunTreeConfig>
  ): Promise<EvaluationResult | EvaluationResults>;
}

Creating a custom RunEvaluator

class MyCustomEvaluator implements RunEvaluator {
  async evaluateRun(
    run: Run,
    example?: Example
  ): Promise<EvaluationResult> {
    // Your evaluation logic
    const score = this.calculateScore(run.outputs, example?.outputs);
    
    return {
      key: "custom_metric",
      score,
      comment: "Evaluation complete",
    };
  }
  
  private calculateScore(outputs: any, referenceOutputs: any): number {
    // Your scoring logic
    return 0.85;
  }
}

const evaluator = new MyCustomEvaluator();

runEvaluator()

Wrapper to convert a function into a RunEvaluator.
import { runEvaluator } from "langsmith/evaluation";

const myEvaluator = runEvaluator(
  async ({ run, example, inputs, outputs, referenceOutputs }) => {
    // Your evaluation logic
    return {
      key: "accuracy",
      score: outputs.answer === referenceOutputs?.answer ? 1 : 0,
    };
  }
);

Function evaluator signatures

Evaluators can accept arguments in multiple ways:
const evaluator = async ({
  run,
  example,
  inputs,
  outputs,
  referenceOutputs,
  attachments,
}) => {
  return {
    key: "metric",
    score: calculateScore(outputs, referenceOutputs),
  };
};

Legacy run/example parameters

const evaluator = async (run: Run, example?: Example) => {
  return {
    key: "metric",
    score: calculateScore(run.outputs, example?.outputs),
  };
};

Category

Represents a categorical class for feedback.
interface Category {
  value?: number;
  label: string;
}

FeedbackConfig

Configuration for how feedback should be interpreted.
interface FeedbackConfig {
  type: "continuous" | "categorical" | "freeform";
  min?: number | null;
  max?: number | null;
  categories?: Category[] | null;
}

Continuous feedback

const result: EvaluationResult = {
  key: "accuracy",
  score: 0.85,
  feedbackConfig: {
    type: "continuous",
    min: 0,
    max: 1,
  },
};

Categorical feedback

const result: EvaluationResult = {
  key: "sentiment",
  score: 1,
  feedbackConfig: {
    type: "categorical",
    categories: [
      { value: 0, label: "negative" },
      { value: 1, label: "neutral" },
      { value: 2, label: "positive" },
    ],
  },
};

Freeform feedback

const result: EvaluationResult = {
  key: "comments",
  value: "This is a great response",
  feedbackConfig: {
    type: "freeform",
  },
};

Example evaluators

Exact match evaluator

const exactMatch = async ({ outputs, referenceOutputs }) => {
  return {
    key: "exact_match",
    score: outputs.answer === referenceOutputs?.answer ? 1 : 0,
    feedbackConfig: {
      type: "categorical",
      categories: [
        { value: 0, label: "incorrect" },
        { value: 1, label: "correct" },
      ],
    },
  };
};

LLM-as-judge evaluator

import OpenAI from "openai";

const llmJudge = async ({ inputs, outputs, referenceOutputs }) => {
  const openai = new OpenAI();
  
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {
        role: "system",
        content: "Rate the quality of this answer from 0-10.",
      },
      {
        role: "user",
        content: `Question: ${inputs.query}\nAnswer: ${outputs.answer}\nReference: ${referenceOutputs?.answer}`,
      },
    ],
  });
  
  const score = parseInt(response.choices[0].message.content || "0");
  
  return {
    key: "llm_judge_quality",
    score: score / 10,
    comment: response.choices[0].message.content,
    feedbackConfig: {
      type: "continuous",
      min: 0,
      max: 1,
    },
  };
};

Multi-metric evaluator

const multiMetric = async ({ outputs, referenceOutputs }) => {
  return {
    results: [
      {
        key: "exact_match",
        score: outputs.answer === referenceOutputs?.answer ? 1 : 0,
      },
      {
        key: "answer_length",
        score: outputs.answer.length,
      },
      {
        key: "has_sources",
        score: outputs.sources?.length > 0 ? 1 : 0,
      },
    ],
  };
};

Build docs developers (and LLMs) love