Basic Experiment
Here’s a simple experiment that runs a task on a dataset:- Execute
answer_questionon each example in the dataset - Capture outputs and execution traces
- Store results in Phoenix
- Display a summary of the experiment
Task Functions
Task functions can access different parts of the example:Single Parameter (Input Only)
Multiple Parameters
Async Tasks
Task Output Format
Tasks must return JSON-serializable data:Experiments with Evaluators
Evaluate experiment outputs using built-in or custom evaluators:- Built-in Evaluators
- Custom Evaluators
- LLM-as-Judge
Evaluator Functions
Evaluators receive the task output and example data:Adding Evaluations to Existing Experiments
You can add evaluations to experiments that have already run:Experiment Configuration
Concurrency
Control how many examples run in parallel:Timeout
Set a timeout for long-running tasks:Rate Limiting
Handle rate limits gracefully:Dry Run
Test your experiment on a subset without storing results:Experiment Metadata
Add rich metadata to track experiment context:Accessing Results
Summary Statistics
Individual Runs
DataFrame Export
Comparing Experiments
Compare multiple experiments in the Phoenix UI:- Side-by-side output comparison
- Evaluation score differences
- Trace viewing for debugging
- Statistical summaries
Best Practices
Start Small
Test your task on a few examples with
dry_run before running the full experiment.Use Async
Implement async tasks with appropriate concurrency for faster experiments.
Handle Errors
Implement error handling in your task to avoid stopping the entire experiment.
Rich Metadata
Add detailed metadata to experiments for better tracking and comparison.
Next Steps
Evaluators
Learn about built-in and custom evaluators
Dataset Versioning
Manage dataset versions and exports