Evaluator Functions

Overview

Evaluators are functions that assess agent runs based on specific criteria. They integrate with LangSmith’s evaluation framework to provide automated quality checks and comparative analysis.

Schema Validation

schemaBeforeQuery()

Verifies that the agent checks database schema before executing data queries.

run

Run

required

LangSmith run object containing execution trace

key

string

Evaluator identifier: "schema_before_query"

score

number

Binary score: 1 if agent checked schema first, 0 otherwise

comment

string

Explanation of the evaluation result

import type { Run } from "langsmith/schemas";

export function schemaBeforeQuery(
  run: Run,
): { key: string; score: number; comment: string } {
  const toolCalls = extractToolCalls(run);
  const dbCalls = toolCalls.filter((tc) => tc.name === "query_database");

  // No database calls -- nothing to check
  if (dbCalls.length === 0) {
    return {
      key: "schema_before_query",
      score: 1,
      comment: "No query_database calls -- schema check not applicable",
    };
  }

  // Check if any schema query appears before the first non-schema data query
  let seenSchemaCheck = false;
  for (const tc of dbCalls) {
    const sql = tc.arguments ?? "";
    if (isSchemaQuery(sql)) {
      seenSchemaCheck = true;
    } else {
      // First real data query -- was there a schema check before it?
      if (!seenSchemaCheck) {
        return {
          key: "schema_before_query",
          score: 0,
          comment: `Agent queried data without checking schema first. First query: ${sql.slice(0, 200)}`,
        };
      }
      break;
    }
  }

  return { 
    key: "schema_before_query", 
    score: 1, 
    comment: "Agent checked schema before querying data" 
  };
}

Schema Query Detection

The evaluator identifies schema queries using regex patterns:

const SCHEMA_PATTERNS = [
  /PRAGMA\s+table_info/i,
  /SELECT\s+.*FROM\s+sqlite_master/i,
  /PRAGMA\s+database_list/i,
  /\.schema/i,
];

function isSchemaQuery(sql: string): boolean {
  return SCHEMA_PATTERNS.some((pattern) => pattern.test(sql));
}

Tool Call Extraction

Helper function to extract tool calls from run outputs:

interface ToolCall {
  name: string;
  arguments: string;
}

function extractToolCalls(run: Run): ToolCall[] {
  const runOutputs = run.outputs ?? {};
  const messages: any[] = runOutputs.messages ?? [];

  const toolCalls: ToolCall[] = [];
  for (const msg of messages) {
    if (typeof msg === "object" && msg !== null) {
      for (const tc of msg.tool_calls ?? []) {
        const func = tc.function ?? {};
        toolCalls.push({
          name: func.name ?? "",
          arguments: func.arguments ?? "",
        });
      }
    }
  }
  return toolCalls;
}

Usage Example

import { evaluate } from "langsmith/evaluation";
import { schemaBeforeQuery } from "./eval_schema_check";

await evaluate(myAgent, {
  data: "customer-support-dataset",
  evaluators: [schemaBeforeQuery],
});

Pairwise Comparison

concisenessEvaluator()

Compares two agent responses for conciseness using LLM-as-a-judge.

inputs

Record<string, any>

required

Test case inputs containing the question

runs

Run[]

required

Array of exactly two runs to compare

key

string

Evaluator identifier: "conciseness"

scores

Record<string, number>

Map of run IDs to scores (winner gets 1, loser gets 0)

import OpenAI from "openai";
import type { Run } from "langsmith/schemas";

const openai = new OpenAI();

const CONCISENESS_PROMPT = `You are evaluating two responses to the same customer question.
Determine which response is MORE CONCISE while still providing all crucial information.

**Conciseness** means getting straight to the point, avoiding filler, and not repeating information.
**Crucial information** includes direct answers, necessary context, and required next steps.

A shorter response is NOT automatically better if it omits crucial information.

**Question:** {question}

**Response A:**
{response_a}

**Response B:**
{response_b}

Output your verdict as a single number:
1 if Response A is more concise while preserving crucial information
2 if Response B is more concise while preserving crucial information
0 if they are roughly equal`;

export async function concisenessEvaluator({
  inputs,
  runs,
}: {
  inputs: Record<string, any>;
  runs: Run[];
}) {
  const [runA, runB] = runs;
  const scores: Record<string, number> = {};

  const response = await openai.chat.completions.create({
    model: "gpt-5-nano",
    messages: [
      {
        role: "system",
        content: "You are a conciseness evaluator. Respond with only a single number: 0, 1, or 2.",
      },
      {
        role: "user",
        content: CONCISENESS_PROMPT
          .replace("{question}", inputs.question)
          .replace("{response_a}", runA?.outputs?.answer ?? "N/A")
          .replace("{response_b}", runB?.outputs?.answer ?? "N/A"),
      },
    ],
  });

  const preference = parseInt(
    response.choices[0].message.content?.trim() ?? "0"
  );

  if (preference === 1) {
    scores[runA.id] = 1;
    scores[runB.id] = 0;
  } else if (preference === 2) {
    scores[runA.id] = 0;
    scores[runB.id] = 1;
  } else {
    scores[runA.id] = 0;
    scores[runB.id] = 0;
  }

  return { key: "conciseness", scores };
}

Running Pairwise Evaluation

import { evaluate } from "langsmith/evaluation";
import { concisenessEvaluator } from "./eval_conciseness_pairwise";

await evaluate(
  ["agent-v4-experiment", "agent-v5-experiment"],
  {
    evaluators: [concisenessEvaluator],
    randomizeOrder: true,
  }
);

Simple Evaluators

mentionsOfficeflow()

Example code-based evaluator that checks if response mentions the company name.

outputs

Record<string, any>

required

Run outputs containing the response

key

string

Evaluator identifier: "mentions_officeflow"

score

boolean

True if response mentions “officeflow” (case-insensitive)

import type { EvaluationResult } from "langsmith/evaluation";

const mentionsOfficeflow = async ({ 
  outputs 
}: {
  outputs: Record<string, any>;
}): Promise<EvaluationResult> => {
  const score = outputs?.response?.toLowerCase().includes("officeflow");
  return { key: "mentions_officeflow", score };
}

Usage Example

import { evaluate } from "langsmith/evaluation";

await evaluate(dummyApp, {
  data: "officeflow-dataset",
  evaluators: [mentionsOfficeflow],
});

Evaluation Types

Code-Based Evaluators

Deterministic functions that check specific conditions:

type CodeBasedEvaluator = (params: {
  run?: Run;
  example?: Example;
  inputs?: Record<string, any>;
  outputs?: Record<string, any>;
}) => EvaluationResult | Promise<EvaluationResult>;

Characteristics:

Fast and cheap to run
Deterministic results
Good for structural checks (schema validation, format checking)
Limited to rule-based logic

LLM-as-Judge Evaluators

Use language models to assess quality:

const response = await openai.chat.completions.create({
  model: "gpt-5-nano",
  messages: [
    { role: "system", content: "You are an evaluator..." },
    { role: "user", content: promptWithContext },
  ],
});

Characteristics:

Can assess subjective qualities (tone, helpfulness, conciseness)
More expensive and slower
Non-deterministic (may vary between runs)
Requires careful prompt engineering

Pairwise Evaluators

Compare two runs directly:

type PairwiseEvaluator = (params: {
  runs: [Run, Run];
  inputs: Record<string, any>;
}) => Promise<{ key: string; scores: Record<string, number> }>;

Characteristics:

Better for A/B testing agent versions
More reliable than absolute scoring
Requires running experiments on both variants
Use randomizeOrder: true to avoid position bias

Integration with LangSmith

Running Evaluations

import { evaluate } from "langsmith/evaluation";

// Evaluate single agent
await evaluate(myAgent, {
  data: "my-dataset",
  evaluators: [schemaBeforeQuery, mentionsOfficeflow],
  experimentPrefix: "agent-v1",
});

// Compare two agent versions
await evaluate(
  ["experiment-1", "experiment-2"],
  {
    evaluators: [concisenessEvaluator],
    randomizeOrder: true,
  }
);

Evaluation Results

Results are logged to LangSmith and include:

Individual test case scores
Aggregate statistics
Comparison charts (for pairwise)
Trace links for debugging failures

Best Practices

Code-Based Evaluators

Be specific in comments - Help developers understand failures
Handle edge cases - Return appropriate scores for N/A cases
Keep them fast - They run on every test case

LLM-as-Judge Evaluators

Use clear rubrics - Define exactly what you’re measuring
Request structured output - Numbers or specific formats
Test your prompts - Run on sample data first
Consider cost - LLM calls add up on large datasets

Pairwise Evaluators

Randomize order - Prevent position bias
Handle ties - Score both as 0 for equal performance
Provide context - Include the original question in prompts
Be consistent - Use same criteria across all comparisons

Python

TypeScript

Overview

Schema Validation

schemaBeforeQuery()

Schema Query Detection

Tool Call Extraction

Usage Example

Pairwise Comparison

concisenessEvaluator()

Running Pairwise Evaluation

Simple Evaluators

mentionsOfficeflow()

Usage Example

Evaluation Types

Code-Based Evaluators

LLM-as-Judge Evaluators

Pairwise Evaluators

Integration with LangSmith

Running Evaluations

Evaluation Results

Best Practices

Code-Based Evaluators

LLM-as-Judge Evaluators

Pairwise Evaluators

Build docs developers (and LLMs) love

Python

TypeScript

​Overview

​Schema Validation

​schemaBeforeQuery()

​Schema Query Detection

​Tool Call Extraction

​Usage Example

​Pairwise Comparison

​concisenessEvaluator()

​Running Pairwise Evaluation

​Simple Evaluators

​mentionsOfficeflow()

​Usage Example

​Evaluation Types

​Code-Based Evaluators

​LLM-as-Judge Evaluators

​Pairwise Evaluators

​Integration with LangSmith

​Running Evaluations

​Evaluation Results

​Best Practices

​Code-Based Evaluators

​LLM-as-Judge Evaluators

​Pairwise Evaluators

Build docs developers (and LLMs) love

Overview

Schema Validation

schemaBeforeQuery()

Schema Query Detection

Tool Call Extraction

Usage Example

Pairwise Comparison

concisenessEvaluator()

Running Pairwise Evaluation

Simple Evaluators

mentionsOfficeflow()

Usage Example

Evaluation Types

Code-Based Evaluators

LLM-as-Judge Evaluators

Pairwise Evaluators

Integration with LangSmith

Running Evaluations

Evaluation Results

Best Practices

Code-Based Evaluators

LLM-as-Judge Evaluators

Pairwise Evaluators