Version-controlled datasets for experiments, evaluation, and fine-tuning with support for creating from production traces
Phoenix datasets provide a structured way to manage test cases, golden datasets, and evaluation sets with built-in versioning. Datasets are the foundation for running experiments and systematic testing of LLM applications.
A dataset in Phoenix is a versioned collection of examples—input/output pairs with optional metadata. Each example represents a test case or reference data point for your LLM application.Datasets support:
Input/Output pairs: Questions and expected answers
Metadata: Additional context like difficulty level, category, or source
Versioning: Track changes over time with immutable versions
Export formats: Share datasets for fine-tuning or external tools
Create datasets programmatically using the Phoenix client:
import phoenix as pxclient = px.Client()# Create a new datasetdataset = client.create_dataset( name="customer-support-qa", description="QA pairs for customer support chatbot")# Add examplesexamples = [ { "input": {"query": "How do I reset my password?"}, "output": {"answer": "Click 'Forgot Password' on the login page..."}, "metadata": {"category": "authentication"} }, { "input": {"query": "What are your business hours?"}, "output": {"answer": "We're open Monday-Friday, 9am-5pm EST."}, "metadata": {"category": "general"} }]for example_data in examples: dataset.add_example(**example_data)
One of the most powerful features is creating datasets directly from production traces using TraceDataset (from src/phoenix/trace/trace_dataset.py):
1
Export traces to TraceDataset
from phoenix.trace import TraceDatasetimport phoenix as pxclient = px.Client()# Get traces from a projectspans_df = client.get_spans_dataframe( project_name="production-chatbot", start_time="2024-01-01", end_time="2024-01-31")# Create TraceDatasettrace_ds = TraceDataset(dataframe=spans_df)
# Get existing datasetdataset = client.get_dataset(name="customer-support-qa")# Add or modify examples (creates new version)dataset.add_example( input={"query": "How do I upgrade my plan?"}, output={"answer": "Navigate to Settings > Billing..."}, metadata={"category": "billing"})print(f"Current version: {dataset.version_id}")
# List all versionsversions = client.list_dataset_versions( dataset_id=dataset.id)for version in versions: print(f"Version {version.id}: {version.created_at}")# Load specific versionold_version = client.get_dataset( dataset_id=dataset.id, version_id="v1-abc123")
import phoenix as pxclient = px.Client()# Get dataset by name (latest version)dataset = client.get_dataset(name="customer-support-qa")# Get specific versiondataset = client.get_dataset( name="customer-support-qa", version_id="v2-xyz789")# List all datasetsdatasets = client.list_datasets()for ds in datasets: print(f"{ds.name}: {len(ds.examples)} examples")
# Filter by metadataauth_examples = { ex_id: ex for ex_id, ex in dataset.examples.items() if ex.metadata.get('category') == 'authentication'}print(f"Found {len(auth_examples)} authentication examples")# Filter by datefrom datetime import datetime, timedeltarecent_cutoff = datetime.now() - timedelta(days=7)recent_examples = { ex_id: ex for ex_id, ex in dataset.examples.items() if ex.updated_at > recent_cutoff}
# Update an existing exampledataset.update_example( example_id="example-123", output={"answer": "Updated answer with more detail..."}, metadata={"reviewed": True, "reviewer": "alice"})# Delete an exampledataset.delete_example(example_id="example-456")
Start Small: Begin with 10-20 high-quality examples and expand based on coverage needs.
Curate from Production: Use real user interactions to build datasets that reflect actual usage patterns.Add Metadata: Tag examples with categories, difficulty, or source to enable filtering and analysis.Version Regularly: Create new versions when making significant changes to track dataset evolution.Include Edge Cases: Add examples for error conditions, edge cases, and challenging scenarios.Review Regularly: Periodically audit examples to ensure they remain relevant and accurate.