Why Datasets Matter
Datasets enable you to:- Track performance across agent versions
- Catch regressions before deploying changes
- Benchmark systematically against production scenarios
- Share evaluation criteria with your team
Start with 10-25 examples covering your core use cases. You can expand the dataset as you discover edge cases through production traces.
Dataset Structure
A dataset is a collection of test cases, where each case typically contains:- Input: The user question or scenario
- Expected Output (optional): Reference answer or behavior
- Metadata (optional): Tags, difficulty level, scenario type
Simple CSV Format
The easiest way to start is with a CSV file:officeflow-dataset.csv
Creating Datasets in LangSmith
officeflow-dataset)from langsmith import Client
client = Client()
# Create dataset
dataset = client.create_dataset(
dataset_name="officeflow-dataset",
description="Customer support queries for OfficeFlow"
)
# Add examples
examples = [
{
"inputs": {"question": "How many reams of copy paper do you have?"},
"outputs": {"expected_tool": "query_database"}
},
{
"inputs": {"question": "What is your return policy?"},
"outputs": {"expected_tool": "search_knowledge_base"}
}
]
for example in examples:
client.create_example(
dataset_id=dataset.id,
inputs=example["inputs"],
outputs=example.get("outputs")
)
from langsmith import Client
client = Client()
# Find interesting traces
runs = client.list_runs(
project_name="production",
filter='eq(feedback_key, "user_rating") and gte(feedback_score, 4)'
)
# Add to dataset
for run in runs:
client.create_example(
dataset_id=dataset_id,
inputs=run.inputs,
outputs=run.outputs
)
Best Practices
Cover Core Scenarios
Ensure your dataset includes:- Happy path queries - Straightforward requests your agent should handle easily
- Edge cases - Ambiguous, multi-part, or unusual requests
- Error scenarios - Invalid inputs, out-of-scope questions
- Complex workflows - Multi-step interactions requiring multiple tools
Example Categories for OfficeFlow
dataset_categories.py
Add Reference Outputs
While not required, reference outputs help with evaluation:with_expected_outputs.py
Using Your Dataset
Once created, reference your dataset by name in experiments:run_experiment.py
Datasets are versioned in LangSmith. You can update examples without breaking existing experiments.
Next Steps
Run Experiments
Connect your dataset to evaluators
Code-based Eval
Write deterministic evaluators