Execute evaluation experiments to measure agent performance systematically
Experiments connect your agent, dataset, and evaluators to produce quantitative measurements of performance. Every experiment creates a snapshot you can compare against future versions.
Here’s a minimal example from the OfficeFlow agent:
from dotenv import load_dotenvfrom langsmith import evaluateload_dotenv()# Target: Your agent functiondef dummy_app(inputs: dict) -> dict: return { "response": "Sure! In OfficeFlow, you can reset your password from the settings page." }# Evaluator: Check if response mentions branddef mentions_officeflow(outputs: dict) -> bool: return "officeflow" in outputs["response"].lower()# Run experimentresults = evaluate( dummy_app, data="officeflow-dataset", evaluators=[mentions_officeflow])
The evaluate function automatically runs your agent on every example in the dataset and applies all evaluators to the outputs.
You can attach evaluators directly to datasets in the LangSmith UI. These run automatically:
auto_evaluators.py
from langsmith import aevaluate# Evaluators bound to the dataset in UI will run automaticallyresults = await aevaluate( chat_wrapper, data="officeflow-dataset" # No evaluators specified - uses dataset's bound evaluators)
Dataset-bound evaluators are useful for organization-wide standards that should apply to all experiments on that dataset.
import uuiddef run_agent(inputs: dict) -> dict: # Create fresh state for each run thread_id = str(uuid.uuid4()) # Reset any global state agent.reset_state() return agent.chat(inputs["question"], thread_id=thread_id)