Skip to main content
Experiments connect your agent, dataset, and evaluators to produce quantitative measurements of performance. Every experiment creates a snapshot you can compare against future versions.

Anatomy of an Experiment

An experiment requires three components:
  1. Target: The agent or function to evaluate
  2. Dataset: Test cases with inputs (and optionally expected outputs)
  3. Evaluators: Functions that score the agent’s outputs
experiment_components.py
from langsmith import evaluate

# 1. Your agent (target)
def your_agent(inputs: dict) -> dict:
    question = inputs["question"]
    # ... agent logic ...
    return {"answer": response}

# 2. Dataset (by name)
dataset_name = "officeflow-dataset"

# 3. Evaluators
def check_mentions_product(outputs: dict) -> bool:
    return "officeflow" in outputs["answer"].lower()

# Run experiment
results = evaluate(
    your_agent,
    data=dataset_name,
    evaluators=[check_mentions_product]
)

Basic Experiment

Here’s a minimal example from the OfficeFlow agent:
from dotenv import load_dotenv
from langsmith import evaluate

load_dotenv()

# Target: Your agent function
def dummy_app(inputs: dict) -> dict:
    return {
        "response": "Sure! In OfficeFlow, you can reset your password from the settings page."
    }

# Evaluator: Check if response mentions brand
def mentions_officeflow(outputs: dict) -> bool:
    return "officeflow" in outputs["response"].lower()

# Run experiment
results = evaluate(
    dummy_app,
    data="officeflow-dataset",
    evaluators=[mentions_officeflow]
)
The evaluate function automatically runs your agent on every example in the dataset and applies all evaluators to the outputs.

Async Agents

For agents using async/await, use aevaluate:
async_experiment.py
import asyncio
from langsmith import aevaluate
from agent_v5 import chat, load_knowledge_base

async def chat_wrapper(inputs: dict) -> dict:
    question = inputs.get("question", "")
    result = await chat(question)
    return {"answer": result["output"], "messages": result["messages"]}

async def main():
    # Load any required resources
    await load_knowledge_base(kb_dir="./knowledge_base")
    
    # Run evaluation
    results = await aevaluate(
        chat_wrapper,
        data="officeflow-dataset"
    )
    return results

if __name__ == "__main__":
    asyncio.run(main())

Experiment Naming

Experiments are automatically named with timestamps. Use experiment_prefix to organize results:
named_experiments.py
from langsmith import evaluate

results = evaluate(
    agent_v5,
    data="officeflow-dataset",
    evaluators=[schema_before_query],
    experiment_prefix="schema-check-v5"  # Appears as "schema-check-v5-{timestamp}"
)

Real-World Example

Here’s how the OfficeFlow course evaluates the schema-checking behavior:
run_eval.py
import asyncio
import sys
from pathlib import Path
from langsmith import evaluate
from langsmith import uuid7

# Import your agent
import agent_v5
from agent_v5 import chat, load_knowledge_base
from eval_schema_check import schema_before_query

async def setup():
    """Load knowledge base before running evals."""
    kb_dir = "./knowledge_base"
    await load_knowledge_base(kb_dir)

def run_agent(inputs: dict) -> dict:
    """Invoke the agent with a fresh thread_id each time."""
    agent_v5.thread_id = str(uuid7())
    return asyncio.run(chat(inputs["question"]))

if __name__ == "__main__":
    asyncio.run(setup())
    
    results = evaluate(
        run_agent,
        data="officeflow-dataset",
        evaluators=[schema_before_query],
        experiment_prefix="schema-check-v5",
    )
Creating a fresh thread_id for each example ensures test isolation - one example’s state doesn’t affect another.

Experiment Results

The evaluate function returns an ExperimentResults object:
analyze_results.py
results = evaluate(
    your_agent,
    data="officeflow-dataset",
    evaluators=[mentions_officeflow]
)

print(f"Experiment: {results.experiment_name}")
print(f"Dataset: {results.dataset_name}")
print(f"Results URL: {results.experiment_url}")

# Access individual evaluator results
for result in results.results:
    print(f"Input: {result.example.inputs}")
    print(f"Output: {result.output}")
    print(f"Scores: {result.evaluation_results}")

Viewing Results

Experiments appear in the LangSmith UI with:
  • Aggregate metrics - Pass rate, average scores
  • Per-example results - See which inputs failed
  • Trace links - Debug individual runs
  • Comparison view - Compare against other experiments
1
Access Experiment Results
2
  • Run your experiment
  • Click the results URL printed to console
  • Or navigate to Datasets → Your Dataset → Experiments tab
  • 3
    Compare Experiments
    4
  • Select two experiments from the same dataset
  • Click Compare
  • View side-by-side metrics and identify regressions
  • Multiple Evaluators

    Pass multiple evaluators to measure different aspects:
    multiple_evaluators.py
    from langsmith import evaluate
    
    def mentions_brand(outputs: dict) -> bool:
        return "officeflow" in outputs["answer"].lower()
    
    def is_concise(outputs: dict) -> bool:
        return len(outputs["answer"].split()) < 50
    
    def uses_tools(run, example) -> dict:
        messages = run.outputs.get("messages", [])
        used_tools = any(msg.get("tool_calls") for msg in messages)
        return {"score": 1 if used_tools else 0}
    
    results = evaluate(
        your_agent,
        data="officeflow-dataset",
        evaluators=[
            mentions_brand,
            is_concise,
            uses_tools
        ]
    )
    

    Using Dataset-Bound Evaluators

    You can attach evaluators directly to datasets in the LangSmith UI. These run automatically:
    auto_evaluators.py
    from langsmith import aevaluate
    
    # Evaluators bound to the dataset in UI will run automatically
    results = await aevaluate(
        chat_wrapper,
        data="officeflow-dataset"  # No evaluators specified - uses dataset's bound evaluators
    )
    
    Dataset-bound evaluators are useful for organization-wide standards that should apply to all experiments on that dataset.

    Best Practices

    Test Isolation

    Ensure each example runs independently:
    test_isolation.py
    import uuid
    
    def run_agent(inputs: dict) -> dict:
        # Create fresh state for each run
        thread_id = str(uuid.uuid4())
        
        # Reset any global state
        agent.reset_state()
        
        return agent.chat(inputs["question"], thread_id=thread_id)
    

    Performance Optimization

    For large datasets, use async evaluation:
    parallel_eval.py
    from langsmith import aevaluate
    
    results = await aevaluate(
        async_agent,
        data="large-dataset",
        evaluators=[evaluator1, evaluator2],
        max_concurrency=10  # Run 10 examples in parallel
    )
    

    Reproducibility

    Set seeds and model parameters for consistent results:
    reproducible_eval.py
    def your_agent(inputs: dict) -> dict:
        response = llm.invoke(
            inputs["question"],
            temperature=0,  # Deterministic sampling
            seed=42         # Reproducible outputs
        )
        return {"answer": response}
    

    Next Steps

    Code-based Eval

    Write deterministic evaluators

    LLM-as-Judge

    Evaluate subjective criteria

    Build docs developers (and LLMs) love