Systematic experiment tracking for testing prompts, models, and retrieval configurations with comprehensive comparison tools
Phoenix experiments enable systematic testing and comparison of LLM application variants. Run your task across a dataset, track results, evaluate outputs, and compare performance—all with full traceability and version control.
Use the run_experiment function (from src/phoenix/experiments/functions.py) to execute a task across a dataset:
1
Define your task
Create a function that takes an input and returns an output:
from openai import OpenAIclient = OpenAI()def chatbot_task(input): """Task function that processes dataset inputs""" query = input['query'] response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": query}] ) return {"answer": response.choices[0].message.content}
2
Load your dataset
import phoenix as pxclient = px.Client()dataset = client.get_dataset(name="customer-support-qa")
3
Run the experiment
from phoenix.experiments import run_experimentresult = run_experiment( dataset=dataset, task=chatbot_task, experiment_name="gpt-4-baseline", experiment_description="Baseline with GPT-4 and default settings")
from phoenix.experiments.evaluators import create_evaluator@create_evaluator(name="answer_length")def length_check(output): """Check if answer is appropriately concise""" answer = output.get('answer', '') word_count = len(answer.split()) if 20 <= word_count <= 100: return 1.0 # Good length return 0.0 # Too short or too longresult = run_experiment( dataset=dataset, task=chatbot_task, evaluators=[length_check])
Run each example multiple times to measure consistency:
result = run_experiment( dataset=dataset, task=task, repetitions=3 # Run each example 3 times)# Analyze variance across repetitionsfor example_id in dataset.examples.keys(): runs = [r for r in result.runs.values() if r.dataset_example_id == example_id] outputs = [r.output for r in runs] print(f"Example {example_id}: {len(set(str(o) for o in outputs))} unique outputs")
Experiments automatically capture distributed traces for each task execution. Traces are linked to experiment runs and include:
Full span hierarchy of task execution
LLM calls with prompts and completions
Retrieval operations with documents
Tool invocations with parameters
Timing and token usage data
Access traces via:
# Get trace ID for a specific runrun = list(result.runs.values())[0]print(f"Trace ID: {run.trace_id}")# View trace in Phoenix UIprint(f"Trace URL: http://localhost:6006/projects/{result.project_name}/traces/{run.trace_id}")
Use Consistent Datasets: Always test variants on the same dataset version for fair comparison.
Name Descriptively: Use experiment names that capture what changed (e.g., “gpt-4-temp-0.5” vs “gpt-4-temp-0.9”).Version Your Code: Tag experiments with git commits or version numbers in metadata.Start Small: Test with dry_run mode before running full experiments.Track Everything: Add metadata about model settings, prompt versions, and experimental parameters.Review Failures: Check error rates and investigate failed examples to improve robustness.