Skip to main content
The Evaluations API allows you to submit span, trace, and document evaluations to Phoenix and retrieve evaluation results for analysis.

Overview

Evaluations (also called annotations) attach quality metrics to your traces. They can be:
  • Trace evaluations: Score entire traces
  • Span evaluations: Score individual spans
  • Document evaluations: Score retrieved documents in RAG applications
Evaluations are stored with an annotator_kind of "LLM" to distinguish them from human annotations.

Endpoints

Add Evaluations

POST /v1/evaluations
Add span, trace, or document evaluations to Phoenix.

Headers

Content-Type
string
required
  • application/x-protobuf: Protocol Buffer format
  • application/x-pandas-arrow: PyArrow table format (recommended)
Content-Encoding
string
Optional compression: gzip (for protobuf only)

Request Body

The request body format depends on the Content-Type:
Binary Protocol Buffer containing an Evaluation message with:
  • name (string, required): Name of the evaluation
  • Evaluation data in protobuf format
The evaluation name must not be blank or empty.
PyArrow IPC stream containing a table with one of these index structures:For trace evaluations:
  • Index: trace_id or context.trace_id
  • Columns: score (float), label (string), explanation (string)
For span evaluations:
  • Index: span_id or context.span_id
  • Columns: score (float), label (string), explanation (string)
For document evaluations:
  • Multi-index: [span_id, document_position] or [context.span_id, document_position]
  • Columns: score (float), label (string), explanation (string)

Response

Returns HTTP 204 (No Content) on success.

Example

import pandas as pd
import pyarrow as pa
import requests

# Create evaluation dataframe
df = pd.DataFrame({
    'score': [0.95, 0.87, 0.72],
    'label': ['correct', 'correct', 'incorrect'],
    'explanation': [
        'Accurate response',
        'Mostly accurate',
        'Contains errors'
    ]
}, index=pd.Index(
    ['trace-id-1', 'trace-id-2', 'trace-id-3'],
    name='trace_id'
))

# Convert to PyArrow table
table = pa.Table.from_pandas(df)

# Serialize to bytes
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, table.schema)
writer.write_table(table)
writer.close()
body = sink.getvalue().to_pybytes()

# Send to Phoenix
response = requests.post(
    'http://localhost:6006/v1/evaluations',
    data=body,
    headers={
        'Authorization': 'Bearer your-api-key',
        'Content-Type': 'application/x-pandas-arrow'
    }
)

print(response.status_code)  # 204

Get Evaluations

GET /v1/evaluations
Retrieve span, trace, or document evaluations from a project.

Query Parameters

project_name
string
Name of the project to get evaluations from. Defaults to "default" if omitted.

Response

Returns a streaming response of PyArrow tables in application/x-pandas-arrow format. Each table contains evaluations for a specific evaluation name, grouped by type (trace/span/document).
Content-Type
string
application/x-pandas-arrow
The response streams multiple PyArrow IPC tables, each representing evaluations for one evaluation name.

Response Schema

trace evaluations
Tables with trace_id index containing:
  • name: Evaluation name
  • score: Numeric score
  • label: Categorical label
  • explanation: Text explanation
  • Additional annotation metadata
span evaluations
Tables with span_id index containing the same fields as trace evaluations
document evaluations
Tables with multi-index [span_id, document_position] containing the same fields

Example

import requests
import pyarrow as pa

url = "http://localhost:6006/v1/evaluations"
headers = {"Authorization": "Bearer your-api-key"}
params = {"project_name": "my-project"}

response = requests.get(url, headers=headers, params=params)

if response.status_code == 200:
    # Read streaming PyArrow tables
    reader = pa.ipc.open_stream(response.content)
    
    for batch in reader:
        df = batch.to_pandas()
        print(df)
else:
    print(f"No evaluations found: {response.status_code}")

Evaluation Data Formats

Trace Evaluations

Evaluate entire conversation traces:
import pandas as pd

df = pd.DataFrame({
    'score': [0.95],
    'label': ['correct'],
    'explanation': ['Response is accurate']
}, index=pd.Index(['trace-id-abc'], name='trace_id'))

Span Evaluations

Evaluate individual LLM calls or operations:
df = pd.DataFrame({
    'score': [0.87],
    'label': ['relevant']
}, index=pd.Index(['span-id-123'], name='span_id'))

Document Evaluations

Evaluate retrieved documents in RAG:
df = pd.DataFrame({
    'score': [0.9, 0.8, 0.95]
}, index=pd.MultiIndex.from_tuples(
    [('span-id-1', 0), ('span-id-1', 1), ('span-id-1', 2)],
    names=['span_id', 'document_position']
))

Error Handling

404
error
No evaluations found for the specified project
415
error
Unsupported content type. Must be application/x-protobuf or application/x-pandas-arrow
422
error
Invalid request body:
  • Evaluation name is blank or empty
  • Invalid PyArrow format
  • Invalid data structure (wrong index columns)

Best Practices

Use PyArrow Format

PyArrow format is more efficient than protobuf for bulk evaluations

Include Explanations

Add explanation text to help understand evaluation decisions

Consistent Naming

Use consistent evaluation names across your project (e.g., “correctness”, “relevance”)

Batch Processing

Send multiple evaluations in a single request for better performance

Using Phoenix SDK

For easier evaluation workflows, use the Phoenix Evals library:
from phoenix.evals import (
    llm_classify,
    run_relevance_eval,
    OpenAIModel
)

# LLM-based classification
results = llm_classify(
    dataframe=df,
    model=OpenAIModel(model="gpt-4"),
    template="Is this response correct?",
    rails=["correct", "incorrect"]
)

# RAG relevance evaluation  
relevance_results = run_relevance_eval(
    dataframe=df,
    model=OpenAIModel(model="gpt-4")
)
See the Evaluation documentation for complete SDK usage.

Build docs developers (and LLMs) love