Evaluation Scores

Helicone Scores let you report evaluation results from any framework (RAGAS, LangSmith, custom evaluations) for centralized observability. Track accuracy, hallucination rates, helpfulness, and custom metrics across all your LLM applications.

Helicone doesn’t run evaluations for you—we provide a centralized location to report and analyze evaluation results from any framework, giving you unified observability across all your evaluation metrics.

Why Use Scores

Centralize Evaluation Results

Report scores from any evaluation framework for unified monitoring and analysis

Track Performance Over Time

Visualize how accuracy, hallucination rates, and other metrics evolve

Compare Experiments

Evaluate different prompts, models, or configurations with consistent metrics

Catch Regressions

Monitor metric trends to detect when changes negatively impact quality

Quick Start

Make a request and capture the ID

Make your LLM request through Helicone and capture the request ID:

import OpenAI from "openai";
import { randomUUID } from "crypto";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://oai.helicone.ai/v1",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

// Use custom request ID for tracking
const requestId = randomUUID();

const response = await openai.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Explain quantum computing" }],
  },
  {
    headers: { "Helicone-Request-Id": requestId },
  }
);

Run your evaluation

Use your evaluation framework or custom logic to assess the response:

// Example: Custom evaluation logic
const scores = {
  accuracy: evaluateAccuracy(response),      // Returns 0-100
  hallucination: detectHallucination(response), // Returns 0-100
  helpfulness: rateHelpfulness(response),    // Returns 0-100
  is_safe: checkSafety(response)             // Returns boolean
};

Report scores to Helicone

Send evaluation results using the Helicone API:

await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    scores: {
      accuracy: 92,
      hallucination: 5,
      helpfulness: 88,
      is_safe: true
    },
  }),
});

View analytics

Analyze evaluation results in the Helicone dashboard to track performance trends, compare experiments, and identify areas for improvement.

Scores are processed with a 10 minute delay by default for analytics aggregation.

API Format

Request Structure

The scores API expects this format:

POST https://api.helicone.ai/v1/request/{requestId}/score

{
  "scores": {
    "metric_name": number | boolean,
    "another_metric": number | boolean
  }
}

Score Values

| Type | Description | Example | |------|-------------|---------|| | integer | Numeric scores (no decimals) | 92, 85, 0 | | boolean | Pass/fail or true/false metrics | true, false |

Float values like 0.92 are rejected. Convert to integers by multiplying by 100:

❌ 0.92 → ✅ 92
❌ 0.08 → ✅ 8

Multiple Scores

You can report multiple metrics in a single API call:

const scores = {
  // RAG metrics
  faithfulness: 95,
  answer_relevancy: 88,
  context_precision: 92,
  
  // Quality metrics
  accuracy: 90,
  completeness: 85,
  clarity: 93,
  
  // Safety metrics
  is_safe: true,
  is_appropriate: true,
  contains_pii: false,
  
  // Performance metrics
  response_time_ms: 1250,
  token_efficiency: 87
};

await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ scores }),
});

Integration Examples

RAGAS (RAG Evaluation)

Evaluate retrieval-augmented generation for accuracy and hallucination:

import requests
from ragas import evaluate
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextPrecision
from datasets import Dataset

def evaluate_rag_response(question, answer, contexts, request_id):
    # Initialize RAGAS metrics
    metrics = [
        Faithfulness(),
        AnswerRelevancy(),
        ContextPrecision()
    ]
    
    # Create dataset in RAGAS format
    data = {
        "question": [question],
        "answer": [answer],
        "contexts": [contexts],
        "ground_truth": [ground_truth]  # If available
    }
    dataset = Dataset.from_dict(data)
    
    # Run evaluation
    result = evaluate(dataset, metrics=metrics)
    
    # Report to Helicone (convert 0-1 to 0-100)
    response = requests.post(
        f"https://api.helicone.ai/v1/request/{request_id}/score",
        headers={
            "Authorization": f"Bearer {HELICONE_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "scores": {
                "faithfulness": int(result.get('faithfulness', 0) * 100),
                "answer_relevancy": int(result.get('answer_relevancy', 0) * 100),
                "context_precision": int(result.get('context_precision', 0) * 100)
            }
        }
    )
    
    return result

# Example usage
scores = evaluate_rag_response(
    question="What is the capital of France?",
    answer="The capital of France is Paris.",
    contexts=["France is a country in Europe. Paris is its capital."],
    request_id="your-request-id-here"
)

View full RAGAS integration guide →

LLM-as-Judge

Use a strong model to evaluate responses from another model:

async function evaluateWithLLMJudge(
  prompt: string,
  response: string,
  requestId: string
) {
  const judgePrompt = `
Evaluate the following AI assistant response on these criteria (0-100):
- Accuracy: Is the information correct?
- Helpfulness: Does it address the user's question?
- Clarity: Is it clear and well-structured?
- Safety: Is it safe and appropriate?

User Question: ${prompt}
Assistant Response: ${response}

Respond in JSON format:
{
  "accuracy": number,
  "helpfulness": number,
  "clarity": number,
  "safety": number,
  "reasoning": "brief explanation"
}
`;

  const judgeResponse = await openai.chat.completions.create({
    model: "gpt-4o",  // Use strong model as judge
    messages: [{ role: "user", content: judgePrompt }],
    response_format: { type: "json_object" }
  });

  const evaluation = JSON.parse(judgeResponse.choices[0].message.content);

  // Report scores to Helicone
  await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.HELICONE_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      scores: {
        accuracy: evaluation.accuracy,
        helpfulness: evaluation.helpfulness,
        clarity: evaluation.clarity,
        safety: evaluation.safety
      },
    }),
  });

  return evaluation;
}

Custom Evaluation Logic

Implement domain-specific evaluation metrics:

// Code generation evaluation
async function evaluateCodeGeneration(
  generatedCode: string,
  requestId: string
) {
  const scores = {
    // Syntax validity
    syntax_valid: await validateSyntax(generatedCode) ? 100 : 0,
    
    // Test pass rate (0-100)
    test_pass_rate: await runTests(generatedCode),
    
    // Code quality metrics
    complexity: 100 - calculateCyclomaticComplexity(generatedCode),
    readability: assessReadability(generatedCode),
    
    // Security checks
    security_score: await runSecurityScan(generatedCode),
    
    // Boolean flags
    follows_style_guide: checkStyleGuide(generatedCode),
    has_documentation: hasDocStrings(generatedCode)
  };

  await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.HELICONE_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ scores }),
  });

  return scores;
}

Automated Evaluation Pipeline

Automatically evaluate all requests using webhooks:

// Set up webhook to trigger evaluation
app.post('/webhook/helicone', async (req, res) => {
  const { requestId, response, model } = req.body;

  // Run evaluation asynchronously
  evaluateRequest(requestId, response, model).catch(console.error);

  res.status(200).send('OK');
});

async function evaluateRequest(
  requestId: string,
  response: any,
  model: string
) {
  // Extract response text
  const text = response.choices?.[0]?.message?.content;
  if (!text) return;

  // Run multiple evaluation methods
  const [ragScore, safetyScore, qualityScore] = await Promise.all([
    evaluateRAG(text),
    evaluateSafety(text),
    evaluateQuality(text)
  ]);

  // Combine scores
  const scores = {
    ...ragScore,
    ...safetyScore,
    ...qualityScore,
    model_used: model
  };

  // Report to Helicone
  await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.HELICONE_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ scores }),
  });
}

Viewing and Analyzing Scores

Dashboard Analytics

Helicone provides several ways to analyze your scores:

Request-level scores: View scores for individual requests in the request detail page
Aggregate metrics: See average, min, and max scores across all requests
Score distributions: Understand the spread of scores with histogram visualizations
Time-based trends: Track how scores change over time
Filtering: Filter requests by score ranges (e.g., accuracy > 90)

Querying Scores via API

Retrieve score analytics programmatically:

// Get all score names
const scoresResponse = await fetch(
  'https://api.helicone.ai/v1/evals/scores',
  {
    headers: {
      'Authorization': `Bearer ${HELICONE_API_KEY}`
    }
  }
);
const scoreNames = await scoresResponse.json();
console.log('Available scores:', scoreNames);

// Query score distributions
const distributionResponse = await fetch(
  'https://api.helicone.ai/v1/evals/score-distributions/query',
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${HELICONE_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      filter: 'all',
      timeFilter: {
        start: '2024-01-01T00:00:00Z',
        end: '2024-12-31T23:59:59Z'
      }
    })
  }
);
const distributions = await distributionResponse.json();

Use Cases

RAG Application Monitoring

Track retrieval-augmented generation quality over time:

# Evaluate every RAG request
for request in production_requests:
    # Run RAGAS evaluation
    result = evaluate_rag(
        question=request.question,
        answer=request.answer,
        contexts=request.contexts
    )
    
    # Report to Helicone
    report_scores(request.id, {
        'faithfulness': int(result['faithfulness'] * 100),
        'answer_relevancy': int(result['answer_relevancy'] * 100),
        'context_recall': int(result['context_recall'] * 100)
    })

# Analyze trends in dashboard
# - Are hallucinations increasing?
# - Is retrieval quality improving?
# - Which queries have low scores?

Model Comparison

Compare different models on the same evaluation dataset:

const models = ['gpt-4o', 'gpt-4o-mini', 'claude-3-5-sonnet'];
const testQuestions = [...]; // Your eval dataset

for (const model of models) {
  for (const question of testQuestions) {
    // Make request
    const response = await makeRequest(model, question);
    
    // Evaluate
    const score = await evaluate(response);
    
    // Report with model tag
    await reportScore(response.id, {
      accuracy: score,
      model: model  // Track which model
    });
  }
}

// Compare in dashboard:
// - Filter by model property
// - View average scores per model
// - Identify which model performs best

A/B Testing

Test prompt changes before full rollout:

// Split traffic between old and new prompt
const useNewPrompt = Math.random() < 0.5;
const prompt = useNewPrompt ? NEW_PROMPT : OLD_PROMPT;

const response = await openai.chat.completions.create(
  { messages: [{ role: 'user', content: prompt }] },
  {
    headers: {
      'Helicone-Property-PromptVersion': useNewPrompt ? 'v2' : 'v1'
    }
  }
);

// Evaluate both versions
const score = await evaluate(response);
await reportScore(response.id, { accuracy: score });

// After collecting data:
// - Filter by PromptVersion property
// - Compare average scores
// - Roll out winning version

Best Practices

Use Consistent Metrics

Define standard metrics across your team and use them consistently

Convert Decimals

Always convert decimal scores (0-1) to integers (0-100) before reporting

Name Clearly

Use descriptive score names like answer_relevancy not score1

Track Context

Use custom properties to segment scores by feature, model, or experiment

Automate Evaluation

Set up automated evaluation pipelines rather than manual scoring

Monitor Trends

Track scores over time to catch quality regressions early

API Reference

Key Endpoints

Endpoint	Method	Description
`/v1/request/{requestId}/score`	POST	Submit scores for a request
`/v1/evals/scores`	GET	Get all score names
`/v1/evals/query`	POST	Query evaluation data
`/v1/evals/score-distributions/query`	POST	Get score distributions

View full API documentation →

Datasets

Create evaluation datasets from scored production traffic

Feedback

Combine automated scores with user feedback for comprehensive quality assessment

Experiments

Compare different configurations with consistent scoring

Custom Properties

Segment scores by feature, model, or experiment

Scores provide objective measurement of LLM response quality. Start with simple metrics like accuracy or helpfulness, then expand to framework-specific evaluations as your needs grow.

Get Started

AI Gateway

Observability

Prompt Management

Evaluation & Testing

Features

Self-Hosting

Integrations

​Why Use Scores

Centralize Evaluation Results

Track Performance Over Time

Compare Experiments

Catch Regressions

​Quick Start

​API Format

​Request Structure

​Score Values

​Multiple Scores

​Integration Examples

​RAGAS (RAG Evaluation)

​LLM-as-Judge

​Custom Evaluation Logic

​Automated Evaluation Pipeline

​Viewing and Analyzing Scores

​Dashboard Analytics

​Querying Scores via API

​Use Cases

​RAG Application Monitoring

​Model Comparison

​A/B Testing

​Best Practices

Use Consistent Metrics

Convert Decimals

Name Clearly

Track Context

Automate Evaluation

Monitor Trends

​API Reference

​Key Endpoints

​Related Features

Datasets

Feedback

Experiments

Custom Properties

Build docs developers (and LLMs) love

Why Use Scores

Quick Start

API Format

Request Structure

Score Values

Multiple Scores

Integration Examples

RAGAS (RAG Evaluation)

LLM-as-Judge

Custom Evaluation Logic

Automated Evaluation Pipeline

Viewing and Analyzing Scores

Dashboard Analytics

Querying Scores via API

Use Cases

RAG Application Monitoring

Model Comparison

A/B Testing

Best Practices

API Reference

Key Endpoints

Related Features