Use language models to evaluate subjective criteria that are hard to check with code
LLM-as-judge evaluation uses a language model to assess subjective qualities in agent outputs - things like helpfulness, tone, coherence, and factual accuracy. This approach bridges the gap between deterministic code checks and human judgment.
LLM-as-judge is powerful but slower and more expensive than code-based evaluation. Use it for subjective criteria where code-based checks aren’t sufficient.
from openai import OpenAIclient = OpenAI()def is_helpful_evaluator(outputs: dict, expected: dict) -> dict: """Use GPT to judge if response is helpful.""" prompt = f""" You are evaluating a customer support response for helpfulness. Question: {expected['question']} Response: {outputs['answer']} Is this response helpful to the customer? Answer with just "yes" or "no". """ response = client.chat.completions.create( model="gpt-5-nano", messages=[{"role": "user", "content": prompt}], temperature=0 ) verdict = response.choices[0].message.content.strip().lower() return { "score": 1 if verdict == "yes" else 0, "comment": f"LLM judged response as {'helpful' if verdict == 'yes' else 'not helpful'}" }
from openai import OpenAIclient = OpenAI()EVALUATION_PROMPT = """You are evaluating customer support responses for quality.Evaluate the response on these criteria:1. **Accuracy**: Does it correctly answer the question?2. **Completeness**: Does it provide all necessary information?3. **Tone**: Is it professional and empathetic?4. **Clarity**: Is it easy to understand?**Question:** {question}**Response:** {response}**Expected Answer (reference):** {expected}Provide your evaluation as a JSON object:{{ "score": 0-100, "reasoning": "Brief explanation of the score"}}"""def comprehensive_evaluator(outputs: dict, expected: dict) -> dict: """Multi-criteria LLM evaluation.""" response = client.chat.completions.create( model="gpt-5-nano", messages=[ {"role": "system", "content": "You are an expert evaluator. Respond only with valid JSON."}, {"role": "user", "content": EVALUATION_PROMPT.format( question=expected["question"], response=outputs["answer"], expected=expected.get("answer", "N/A") )} ], temperature=0, response_format={"type": "json_object"} ) import json result = json.loads(response.choices[0].message.content) return { "score": result["score"] / 100, # Normalize to 0-1 "comment": result["reasoning"] }
Evaluate if the agent’s answer is factually correct:
eval_correctness.py
from openai import OpenAIfrom langsmith import evaluateclient = OpenAI()CORRECTNESS_PROMPT = """You are evaluating whether an agent's response correctly answers the customer's question.Compare the agent's response to the reference answer. The agent's response doesn't need to match word-for-word, but it should contain the same key information.**Question:** {question}**Reference Answer:** {reference}**Agent Response:** {response}Is the agent's response correct?Respond with ONLY: "correct" or "incorrect""""def correctness_evaluator(outputs: dict, expected: dict) -> dict: """Check if agent response is factually correct.""" response = client.chat.completions.create( model="gpt-5-nano", messages=[{ "role": "system", "content": "You are a precise evaluator. Respond with only 'correct' or 'incorrect'." }, { "role": "user", "content": CORRECTNESS_PROMPT.format( question=expected.get("question", "N/A"), reference=expected.get("answer", "N/A"), response=outputs["answer"] ) }], temperature=0 ) verdict = response.choices[0].message.content.strip().lower() is_correct = "correct" in verdict return { "score": 1 if is_correct else 0, "comment": f"Response is {verdict}" }# Use in experimentresults = evaluate( your_agent, data="officeflow-dataset", evaluators=[correctness_evaluator])
Using temperature=0 makes LLM evaluations more deterministic and reproducible.
from langsmith import evaluatefrom langsmith.evaluation import LangChainStringEvaluator# Use built-in evaluatorsresults = evaluate( your_agent, data="officeflow-dataset", evaluators=[ LangChainStringEvaluator("qa"), # Question-answering correctness LangChainStringEvaluator("helpfulness"), # How helpful is the response LangChainStringEvaluator("relevance"), # Is response relevant to question ])
Improve evaluation quality by asking the LLM to reason step-by-step:
cot_evaluation.py
from openai import OpenAIclient = OpenAI()COT_PROMPT = """You are evaluating a customer support response.**Question:** {question}**Response:** {response}Evaluate this response step-by-step:1. **Identify the customer's core need**: What is the customer really asking for?2. **Check completeness**: Does the response address all parts of the question?3. **Assess accuracy**: Is the information provided correct?4. **Evaluate tone**: Is the response professional and empathetic?5. **Final verdict**: Based on the above, is this a good response?Provide your analysis in this format:
Provide concrete criteria rather than vague instructions:
rubric_example.py
GOOD_PROMPT = """Evaluate if the response is helpful:- Does it directly answer the question? (yes/no)- Does it provide actionable next steps? (yes/no)- Is it concise (under 100 words)? (yes/no)The response is helpful if all three criteria are met."""VAGUE_PROMPT = """Is this response good? Answer yes or no."""