Skip to main content
Helicone provides a complete toolkit for evaluating and testing your LLM applications. Create datasets from production traffic, measure performance with custom scores, and collect user feedback to continuously improve your AI systems.

Why Evaluation Matters

Measure Quality

Track accuracy, hallucination rates, and custom metrics across your LLM applications

Build Better Models

Use production data to create training datasets and fine-tune models

Catch Regressions

Test changes against consistent evaluation sets before deploying to production

Understand Users

Collect implicit and explicit feedback to learn what responses work best

Evaluation Workflow

Helicone’s evaluation features work together to create a continuous improvement loop:
1

Capture production data

Log all LLM requests automatically with Helicone’s proxy integration
2

Create datasets

Select and curate high-quality examples from production traffic for evaluation and fine-tuningLearn about Datasets →
3

Score responses

Run evaluation frameworks (RAGAS, LangSmith, custom) and report scores to Helicone for centralized trackingLearn about Scores →
4

Collect feedback

Gather user ratings and behavioral signals to identify what worksLearn about Feedback →
5

Analyze and iterate

Use evaluation data to refine prompts, switch models, and improve response quality

Key Features

Datasets

Transform production requests into curated datasets for evaluation and fine-tuning:
  • Select from production: Filter requests using custom properties, scores, or feedback ratings
  • Curate quality examples: Review and edit request/response pairs before adding to datasets
  • Export multiple formats: Download as JSONL for fine-tuning or CSV for analysis
  • API integration: Programmatically create and manage datasets
// Create dataset from production traffic
const response = await fetch('https://api.helicone.ai/v1/helicone-dataset', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${HELICONE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    datasetName: 'Customer Support Examples',
    requestIds: ['req_123', 'req_456', 'req_789']
  })
});

Scores

Report evaluation results from any framework for unified observability:
  • Framework agnostic: Works with RAGAS, LangSmith, or custom evaluation logic
  • Track over time: Visualize how metrics evolve across deployments
  • Compare experiments: Evaluate different prompts, models, or configurations
  • Custom metrics: Track any integer or boolean metric (accuracy, hallucination, safety)
// Report evaluation scores
await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${HELICONE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    scores: {
      accuracy: 92,
      hallucination: 5,
      helpfulness: 88
    }
  })
});

Feedback

Collect user satisfaction signals to understand response quality:
  • Explicit ratings: Thumbs up/down, star ratings from users
  • Implicit signals: Track acceptance, engagement, and behavioral patterns
  • Production insights: Learn what actually works for real users
  • Dataset curation: Use highly-rated responses for training examples
// Submit user feedback
await fetch(`https://api.helicone.ai/v1/request/${requestId}/feedback`, {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${HELICONE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    rating: true  // true = positive, false = negative
  })
});

Common Evaluation Patterns

RAG Evaluation with RAGAS

Evaluate retrieval-augmented generation for accuracy and groundedness:
from ragas import evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy
import requests

# Run RAGAS evaluation
result = evaluate(dataset, metrics=[Faithfulness(), ResponseRelevancy()])

# Report to Helicone
response = requests.post(
    f"https://api.helicone.ai/v1/request/{request_id}/score",
    headers={"Authorization": f"Bearer {HELICONE_API_KEY}"},
    json={
        "scores": {
            "faithfulness": int(result['faithfulness'] * 100),
            "relevancy": int(result['answer_relevancy'] * 100)
        }
    }
)
View full RAGAS integration guide →

Replace Expensive Models

Use production logs from premium models to fine-tune cheaper alternatives:
1

Log premium model outputs

Start logging successful requests from GPT-4, Claude Sonnet, or other expensive models
2

Create task-specific datasets

Filter and curate examples for specific use cases (support, extraction, generation)
3

Fine-tune smaller models

Export JSONL and train GPT-4o-mini, Gemini Flash, or other cost-effective models
4

Evaluate performance

Compare fine-tuned model against original using consistent evaluation datasets
5

Deploy and iterate

Continuously collect examples to improve the fine-tuned model

Continuous Improvement Pipeline

Build a data flywheel for ongoing model improvement:
  1. Tag production traffic with custom properties for segmentation
  2. Score automatically using evaluation frameworks or LLM-as-judge
  3. Collect user feedback through explicit ratings and implicit signals
  4. Filter top performers by combining scores and feedback ratings
  5. Auto-curate datasets with requests meeting quality thresholds
  6. Retrain periodically with new high-quality examples
  7. A/B test improvements before full deployment

Integration Examples

import openai
import requests
import uuid

# Make request through Helicone
client = openai.OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {os.environ['HELICONE_API_KEY']}"
    }
)

request_id = str(uuid.uuid4())
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    extra_headers={"Helicone-Request-Id": request_id}
)

# Evaluate the response
score = evaluate_response(response.choices[0].message.content)

# Report score to Helicone
requests.post(
    f"https://api.helicone.ai/v1/request/{request_id}/score",
    headers={"Authorization": f"Bearer {os.environ['HELICONE_API_KEY']}"},
    json={"scores": {"quality": score}}
)

# Collect user feedback
user_liked = get_user_feedback()
requests.post(
    f"https://api.helicone.ai/v1/request/{request_id}/feedback",
    headers={"Authorization": f"Bearer {os.environ['HELICONE_API_KEY']}"},
    json={"rating": user_liked}
)

Best Practices

Start Small

Begin with 50-100 carefully curated examples rather than thousands of uncurated ones

Focus on Tasks

Create task-specific datasets and metrics instead of general-purpose evaluations

Combine Signals

Use automated scores AND user feedback for comprehensive quality assessment

Iterate Continuously

Build evaluation into your development workflow, not just during initial testing

Track Over Time

Monitor metrics across deployments to catch regressions early

Test Before Deploy

Evaluate prompt or model changes against consistent test sets

Next Steps

Datasets

Create datasets from production traffic

Scores

Track evaluation metrics and performance

Feedback

Collect user satisfaction signals

RAGAS Integration

Evaluate RAG applications with RAGAS

Experiments

Compare different configurations

API Reference

View API documentation

Evaluation is not a one-time task—it’s an ongoing process. Start with basic metrics, build datasets from production, and continuously improve based on real-world performance.

Build docs developers (and LLMs) love