Overview
Evaluators are functions that assess agent runs based on specific criteria. They integrate with LangSmith’s evaluation framework to provide automated quality checks and comparative analysis.
Schema Validation
schemaBeforeQuery()
Verifies that the agent checks database schema before executing data queries.
LangSmith run object containing execution trace
Evaluator identifier: "schema_before_query"
Binary score: 1 if agent checked schema first, 0 otherwise
Explanation of the evaluation result
import type { Run } from "langsmith/schemas";
export function schemaBeforeQuery(
run: Run,
): { key: string; score: number; comment: string } {
const toolCalls = extractToolCalls(run);
const dbCalls = toolCalls.filter((tc) => tc.name === "query_database");
// No database calls -- nothing to check
if (dbCalls.length === 0) {
return {
key: "schema_before_query",
score: 1,
comment: "No query_database calls -- schema check not applicable",
};
}
// Check if any schema query appears before the first non-schema data query
let seenSchemaCheck = false;
for (const tc of dbCalls) {
const sql = tc.arguments ?? "";
if (isSchemaQuery(sql)) {
seenSchemaCheck = true;
} else {
// First real data query -- was there a schema check before it?
if (!seenSchemaCheck) {
return {
key: "schema_before_query",
score: 0,
comment: `Agent queried data without checking schema first. First query: ${sql.slice(0, 200)}`,
};
}
break;
}
}
return {
key: "schema_before_query",
score: 1,
comment: "Agent checked schema before querying data"
};
}
Schema Query Detection
The evaluator identifies schema queries using regex patterns:
const SCHEMA_PATTERNS = [
/PRAGMA\s+table_info/i,
/SELECT\s+.*FROM\s+sqlite_master/i,
/PRAGMA\s+database_list/i,
/\.schema/i,
];
function isSchemaQuery(sql: string): boolean {
return SCHEMA_PATTERNS.some((pattern) => pattern.test(sql));
}
Helper function to extract tool calls from run outputs:
interface ToolCall {
name: string;
arguments: string;
}
function extractToolCalls(run: Run): ToolCall[] {
const runOutputs = run.outputs ?? {};
const messages: any[] = runOutputs.messages ?? [];
const toolCalls: ToolCall[] = [];
for (const msg of messages) {
if (typeof msg === "object" && msg !== null) {
for (const tc of msg.tool_calls ?? []) {
const func = tc.function ?? {};
toolCalls.push({
name: func.name ?? "",
arguments: func.arguments ?? "",
});
}
}
}
return toolCalls;
}
Usage Example
import { evaluate } from "langsmith/evaluation";
import { schemaBeforeQuery } from "./eval_schema_check";
await evaluate(myAgent, {
data: "customer-support-dataset",
evaluators: [schemaBeforeQuery],
});
Pairwise Comparison
concisenessEvaluator()
Compares two agent responses for conciseness using LLM-as-a-judge.
inputs
Record<string, any>
required
Test case inputs containing the question
Array of exactly two runs to compare
Evaluator identifier: "conciseness"
Map of run IDs to scores (winner gets 1, loser gets 0)
import OpenAI from "openai";
import type { Run } from "langsmith/schemas";
const openai = new OpenAI();
const CONCISENESS_PROMPT = `You are evaluating two responses to the same customer question.
Determine which response is MORE CONCISE while still providing all crucial information.
**Conciseness** means getting straight to the point, avoiding filler, and not repeating information.
**Crucial information** includes direct answers, necessary context, and required next steps.
A shorter response is NOT automatically better if it omits crucial information.
**Question:** {question}
**Response A:**
{response_a}
**Response B:**
{response_b}
Output your verdict as a single number:
1 if Response A is more concise while preserving crucial information
2 if Response B is more concise while preserving crucial information
0 if they are roughly equal`;
export async function concisenessEvaluator({
inputs,
runs,
}: {
inputs: Record<string, any>;
runs: Run[];
}) {
const [runA, runB] = runs;
const scores: Record<string, number> = {};
const response = await openai.chat.completions.create({
model: "gpt-5-nano",
messages: [
{
role: "system",
content: "You are a conciseness evaluator. Respond with only a single number: 0, 1, or 2.",
},
{
role: "user",
content: CONCISENESS_PROMPT
.replace("{question}", inputs.question)
.replace("{response_a}", runA?.outputs?.answer ?? "N/A")
.replace("{response_b}", runB?.outputs?.answer ?? "N/A"),
},
],
});
const preference = parseInt(
response.choices[0].message.content?.trim() ?? "0"
);
if (preference === 1) {
scores[runA.id] = 1;
scores[runB.id] = 0;
} else if (preference === 2) {
scores[runA.id] = 0;
scores[runB.id] = 1;
} else {
scores[runA.id] = 0;
scores[runB.id] = 0;
}
return { key: "conciseness", scores };
}
Running Pairwise Evaluation
import { evaluate } from "langsmith/evaluation";
import { concisenessEvaluator } from "./eval_conciseness_pairwise";
await evaluate(
["agent-v4-experiment", "agent-v5-experiment"],
{
evaluators: [concisenessEvaluator],
randomizeOrder: true,
}
);
Simple Evaluators
mentionsOfficeflow()
Example code-based evaluator that checks if response mentions the company name.
outputs
Record<string, any>
required
Run outputs containing the response
Evaluator identifier: "mentions_officeflow"
True if response mentions “officeflow” (case-insensitive)
import type { EvaluationResult } from "langsmith/evaluation";
const mentionsOfficeflow = async ({
outputs
}: {
outputs: Record<string, any>;
}): Promise<EvaluationResult> => {
const score = outputs?.response?.toLowerCase().includes("officeflow");
return { key: "mentions_officeflow", score };
}
Usage Example
import { evaluate } from "langsmith/evaluation";
await evaluate(dummyApp, {
data: "officeflow-dataset",
evaluators: [mentionsOfficeflow],
});
Evaluation Types
Code-Based Evaluators
Deterministic functions that check specific conditions:
type CodeBasedEvaluator = (params: {
run?: Run;
example?: Example;
inputs?: Record<string, any>;
outputs?: Record<string, any>;
}) => EvaluationResult | Promise<EvaluationResult>;
Characteristics:
- Fast and cheap to run
- Deterministic results
- Good for structural checks (schema validation, format checking)
- Limited to rule-based logic
LLM-as-Judge Evaluators
Use language models to assess quality:
const response = await openai.chat.completions.create({
model: "gpt-5-nano",
messages: [
{ role: "system", content: "You are an evaluator..." },
{ role: "user", content: promptWithContext },
],
});
Characteristics:
- Can assess subjective qualities (tone, helpfulness, conciseness)
- More expensive and slower
- Non-deterministic (may vary between runs)
- Requires careful prompt engineering
Pairwise Evaluators
Compare two runs directly:
type PairwiseEvaluator = (params: {
runs: [Run, Run];
inputs: Record<string, any>;
}) => Promise<{ key: string; scores: Record<string, number> }>;
Characteristics:
- Better for A/B testing agent versions
- More reliable than absolute scoring
- Requires running experiments on both variants
- Use
randomizeOrder: true to avoid position bias
Integration with LangSmith
Running Evaluations
import { evaluate } from "langsmith/evaluation";
// Evaluate single agent
await evaluate(myAgent, {
data: "my-dataset",
evaluators: [schemaBeforeQuery, mentionsOfficeflow],
experimentPrefix: "agent-v1",
});
// Compare two agent versions
await evaluate(
["experiment-1", "experiment-2"],
{
evaluators: [concisenessEvaluator],
randomizeOrder: true,
}
);
Evaluation Results
Results are logged to LangSmith and include:
- Individual test case scores
- Aggregate statistics
- Comparison charts (for pairwise)
- Trace links for debugging failures
Best Practices
Code-Based Evaluators
- Be specific in comments - Help developers understand failures
- Handle edge cases - Return appropriate scores for N/A cases
- Keep them fast - They run on every test case
LLM-as-Judge Evaluators
- Use clear rubrics - Define exactly what you’re measuring
- Request structured output - Numbers or specific formats
- Test your prompts - Run on sample data first
- Consider cost - LLM calls add up on large datasets
Pairwise Evaluators
- Randomize order - Prevent position bias
- Handle ties - Score both as 0 for equal performance
- Provide context - Include the original question in prompts
- Be consistent - Use same criteria across all comparisons