Datasets & Experiments Overview

Datasets in Phoenix are collections of input/output examples that enable systematic experimentation, evaluation, and fine-tuning of your AI applications. By organizing your data into versioned datasets, you can run experiments, compare results, and track improvements over time.

What are Datasets?

A dataset in Phoenix is a structured collection of examples, where each example consists of:

Input: The data provided to your model or application (e.g., user prompts, questions)
Output: The expected or reference output (e.g., correct answers, ideal responses)
Metadata: Additional information about the example (e.g., difficulty level, category, source)

Datasets are automatically versioned, allowing you to track changes and compare experiments across different versions.

What are Experiments?

An experiment is a systematic evaluation of your AI application’s performance on a dataset. Each experiment:

Runs a task (your AI application logic) on every example in a dataset
Captures the output and execution trace for each run
Optionally evaluates the outputs using evaluators (metrics and quality checks)
Stores results for comparison and analysis

Key Use Cases

Model Evaluation

Test your models against benchmarks and ground truth data to measure accuracy and quality.

Regression Testing

Ensure changes to your application don’t degrade performance on known examples.

A/B Testing

Compare different models, prompts, or configurations to find the best approach.

Fine-tuning Preparation

Curate high-quality datasets for training and fine-tuning language models.

Workflow Overview

Create a Dataset

Build datasets from production traces, upload CSV/DataFrame, or manually create examples.

from phoenix.client import Client

client = Client()

# Create from lists of inputs and outputs
dataset = client.datasets.create_dataset(
    name="qa-dataset",
    inputs=[{"question": "What is AI?"}, {"question": "What is ML?"}],
    outputs=[{"answer": "Artificial Intelligence is..."}, {"answer": "Machine Learning is..."}]
)

Run an Experiment

Execute your task function on each example and capture results.

from phoenix.experiments import run_experiment

def my_task(input):
    # Your application logic
    question = input["question"]
    return {"answer": generate_answer(question)}

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    experiment_name="baseline-v1"
)

Evaluate Results

Apply evaluators to measure quality, correctness, and other metrics.

from phoenix.experiments.evaluators import create_evaluator

# Add evaluation
evaluated = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[
        create_evaluator("exact_match"),
        create_evaluator("llm_judge")
    ]
)

Compare & Analyze

View results in the Phoenix UI and compare across experiments.Access the experiment URL to see detailed breakdowns, trace links, and side-by-side comparisons.

Dataset Versioning

Phoenix automatically versions your datasets:

Each modification creates a new version
Experiments are tied to specific versions for reproducibility
You can retrieve and compare any historical version

# Get specific version
versioned_dataset = client.datasets.get_dataset(
    dataset="qa-dataset",
    version_id="v_abc123"
)

# List all versions
versions = client.datasets.get_dataset_versions(dataset="qa-dataset")
for version in versions:
    print(f"Version {version['version_id']} created at {version['created_at']}")

Trace Association

Datasets can be linked to production traces, enabling you to:

Create datasets from real user interactions
Track which traces contributed to each example
Debug issues by reviewing original execution context

# Create dataset with span ID associations
dataset = client.datasets.create_dataset(
    name="production-examples",
    dataframe=spans_df,
    input_keys=["question"],
    output_keys=["answer"],
    span_id_key="context.span_id"  # Links to traces
)

Next Steps

Creating Datasets

Learn different ways to create and populate datasets

Running Experiments

Execute experiments and evaluate your AI applications

Dataset Versioning

Manage versions, tags, and export datasets

Evaluators

Understand built-in and custom evaluators

Get Started

Core Features

Tracing

Evaluation

Datasets & Experiments

Integrations

Datasets & Experiments Overview

What are Datasets?

What are Experiments?

Key Use Cases

Model Evaluation

Regression Testing

A/B Testing

Fine-tuning Preparation

Workflow Overview

Dataset Versioning

Trace Association

Next Steps

Creating Datasets

Running Experiments

Dataset Versioning

Evaluators

Build docs developers (and LLMs) love

Get Started

Core Features

Tracing

Evaluation

Datasets & Experiments

Integrations

​What are Datasets?

​What are Experiments?

​Key Use Cases

Model Evaluation

Regression Testing

A/B Testing

Fine-tuning Preparation

​Workflow Overview

​Dataset Versioning

​Trace Association

​Next Steps

Creating Datasets

Running Experiments

Dataset Versioning

Evaluators

Build docs developers (and LLMs) love

What are Datasets?

What are Experiments?

Key Use Cases

Workflow Overview

Dataset Versioning

Trace Association

Next Steps