Skip to main content
Test datasets are the foundation of reliable agent evaluation. They provide consistent, repeatable inputs that let you measure your agent’s performance over time and detect regressions as you iterate.

Why Datasets Matter

Datasets enable you to:
  • Track performance across agent versions
  • Catch regressions before deploying changes
  • Benchmark systematically against production scenarios
  • Share evaluation criteria with your team
Start with 10-25 examples covering your core use cases. You can expand the dataset as you discover edge cases through production traces.

Dataset Structure

A dataset is a collection of test cases, where each case typically contains:
  • Input: The user question or scenario
  • Expected Output (optional): Reference answer or behavior
  • Metadata (optional): Tags, difficulty level, scenario type

Simple CSV Format

The easiest way to start is with a CSV file:
officeflow-dataset.csv
question
How many reams of copy paper do you currently have available?
I need spiral notebooks for my team — can you tell me how many 3-packs you have right now?
Do you carry ballpoint pens?
I've been trying to order copy paper all week and your website keeps crashing. Just tell me — do you even have any left and how much does it cost?
Can you check if you have spiral notebooks AND staplers in stock?
What is your return policy? How long do I have to return something?
I placed an order yesterday and realized I put the wrong address. Can you change it? What are my options?

Creating Datasets in LangSmith

1
Upload via UI
2
  • Navigate to Datasets in LangSmith
  • Click New Dataset
  • Name your dataset (e.g., officeflow-dataset)
  • Upload your CSV file
  • Map columns to input fields
  • 3
    Create Programmatically
    4
    You can also create datasets using the LangSmith SDK:
    5
    from langsmith import Client
    
    client = Client()
    
    # Create dataset
    dataset = client.create_dataset(
        dataset_name="officeflow-dataset",
        description="Customer support queries for OfficeFlow"
    )
    
    # Add examples
    examples = [
        {
            "inputs": {"question": "How many reams of copy paper do you have?"},
            "outputs": {"expected_tool": "query_database"}
        },
        {
            "inputs": {"question": "What is your return policy?"},
            "outputs": {"expected_tool": "search_knowledge_base"}
        }
    ]
    
    for example in examples:
        client.create_example(
            dataset_id=dataset.id,
            inputs=example["inputs"],
            outputs=example.get("outputs")
        )
    
    6
    Import from Production Traces
    7
    Leverage real user interactions:
    8
    from langsmith import Client
    
    client = Client()
    
    # Find interesting traces
    runs = client.list_runs(
        project_name="production",
        filter='eq(feedback_key, "user_rating") and gte(feedback_score, 4)'
    )
    
    # Add to dataset
    for run in runs:
        client.create_example(
            dataset_id=dataset_id,
            inputs=run.inputs,
            outputs=run.outputs
        )
    

    Best Practices

    Cover Core Scenarios

    Ensure your dataset includes:
    • Happy path queries - Straightforward requests your agent should handle easily
    • Edge cases - Ambiguous, multi-part, or unusual requests
    • Error scenarios - Invalid inputs, out-of-scope questions
    • Complex workflows - Multi-step interactions requiring multiple tools

    Example Categories for OfficeFlow

    dataset_categories.py
    categories = {
        "inventory_queries": [
            "How many reams of copy paper do you have?",
            "Do you carry staplers?"
        ],
        "policy_questions": [
            "What is your return policy?",
            "What payment methods do you accept?"
        ],
        "frustrated_customers": [
            "I've been trying to order copy paper all week and your website keeps crashing.",
            "This is the THIRD time I've called about my lost package."
        ],
        "multi_part_requests": [
            "Can you check if you have spiral notebooks AND staplers in stock?",
            "What's the price on blue pens and do you offer free shipping?"
        ]
    }
    

    Add Reference Outputs

    While not required, reference outputs help with evaluation:
    with_expected_outputs.py
    examples = [
        {
            "inputs": {"question": "How many reams of copy paper do you have?"},
            "outputs": {
                "expected_answer": "We currently have 50 reams of copy paper in stock.",
                "should_use_tool": "query_database",
                "should_check_schema": True
            }
        }
    ]
    

    Using Your Dataset

    Once created, reference your dataset by name in experiments:
    run_experiment.py
    from langsmith import evaluate
    
    results = evaluate(
        your_agent,
        data="officeflow-dataset",  # Reference by name
        evaluators=[...]
    )
    
    Datasets are versioned in LangSmith. You can update examples without breaking existing experiments.

    Next Steps

    Run Experiments

    Connect your dataset to evaluators

    Code-based Eval

    Write deterministic evaluators

    Build docs developers (and LLMs) love