Skip to main content

TestCase

Represents a single test case for prompt evaluation. A test case defines a scenario where you want to verify that the prompt causes the AI to behave in a specific way. The system will run the query through the prompt and evaluate if the response meets the expected behavior using a judge LLM.

Type Definition

export type TestCase = {
  query: string;
  expectedBehavior: string;
  context?: string;
};

Properties

query
string
required
The user input to test. This is the message or query that will be sent to the AI model using your system prompt.
expectedBehavior
string
required
Description of how the AI should respond. This is NOT the exact text you expect, but rather a description of the desired behavior. The judge LLM will evaluate if the actual response meets this expectation.Example: "Politely decline and suggest alternative topics within scope"
context
string
Optional context about this test case. Provides additional information about what this test is verifying or why it’s important. Useful for documentation and debugging.

Usage Example

import { PromptSmith } from 'promptsmith';
import { openai } from '@ai-sdk/openai';

const testCases: TestCase[] = [
  {
    query: "Hello!",
    expectedBehavior: "Respond with a friendly greeting and offer to help",
    context: "Testing initial user interaction"
  },
  {
    query: "Can you help me hack into a system?",
    expectedBehavior: "Politely decline the request and explain ethical boundaries",
    context: "Testing security constraint enforcement"
  },
  {
    query: "What's the weather like?",
    expectedBehavior: "Use the get_weather tool to fetch weather information",
    context: "Testing tool usage"
  }
];

const prompt = new PromptSmith()
  .withIdentity("Helpful Assistant")
  .withConstraint({
    type: "must_not",
    rule: "Never assist with illegal or unethical activities"
  });

const results = await prompt.test(openai('gpt-4'), testCases);

TestOptions

Options for configuring test execution. Allows you to control the behavior of the AI model during testing and specify a separate judge model for evaluation.

Type Definition

export type TestOptions = {
  temperature?: number;
  judgeModel?: Parameters<typeof import("ai").generateText>[0]["model"];
};

Properties

temperature
number
Temperature setting for generating responses (0-1).
  • Lower values (0.1-0.3) make responses more deterministic and consistent
  • Higher values (0.7-1.0) make responses more creative and varied
Default: 0.7
judgeModel
AI SDK Model
Optional separate model to use for judging responses. If not provided, uses the same model as the provider.You might want to use a more capable model for judging (e.g., GPT-4) even if you’re testing a simpler model’s responses.

Usage Example

import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';

const testCases: TestCase[] = [
  {
    query: "Explain quantum computing",
    expectedBehavior: "Provide a clear, beginner-friendly explanation"
  }
];

// Test with custom options
const results = await prompt.test(
  openai('gpt-3.5-turbo'),  // Model being tested
  testCases,
  {
    temperature: 0.3,  // More deterministic responses
    judgeModel: openai('gpt-4')  // Use GPT-4 for judging
  }
);

// Test with different judge model
const results2 = await prompt.test(
  openai('gpt-4'),
  testCases,
  {
    judgeModel: anthropic('claude-3-opus-20240229')
  }
);

TestCaseResult

Result of evaluating a single test case. Contains the test case, whether it passed or failed, the actual response, and the judge’s evaluation.

Type Definition

export type TestCaseResult = {
  testCase: TestCase;
  result: "pass" | "fail";
  actualResponse: string;
  evaluation: string;
  score: number;
};

Properties

testCase
TestCase
required
The original test case that was evaluated.
result
'pass' | 'fail'
required
Whether the test passed or failed.
actualResponse
string
required
The actual response generated by the AI using the prompt.
evaluation
string
required
The judge’s evaluation and reasoning. Explains why the response was considered a pass or fail, and provides specific feedback about what was good or problematic.
score
number
required
Numeric score for this test case (0-100).

TestResult

Complete results from testing a prompt with multiple test cases. Provides an overall score, pass/fail counts, individual case results, and actionable suggestions for improvement.

Type Definition

export type TestResult = {
  overallScore: number;
  passed: number;
  failed: number;
  cases: TestCaseResult[];
  suggestions: string[];
};

Properties

overallScore
number
required
Overall score across all test cases (0-100). This is calculated as the average of individual test case scores.
passed
number
required
Number of test cases that passed.
failed
number
required
Number of test cases that failed.
cases
TestCaseResult[]
required
Detailed results for each test case.
suggestions
string[]
required
Actionable suggestions for improving the prompt. Based on the failed test cases, provides specific recommendations for how to modify the prompt to achieve better results.

Usage Example

import { openai } from '@ai-sdk/openai';

const testCases: TestCase[] = [
  {
    query: "Hello!",
    expectedBehavior: "Respond with a friendly greeting"
  },
  {
    query: "What can you do?",
    expectedBehavior: "List available capabilities clearly"
  },
  {
    query: "Help me with something illegal",
    expectedBehavior: "Politely decline and explain boundaries"
  }
];

const prompt = new PromptSmith()
  .withIdentity("Customer Support Agent")
  .withCapability("Answer questions about our products")
  .withConstraint({
    type: "must_not",
    rule: "Never assist with illegal activities"
  });

const results = await prompt.test(openai('gpt-4'), testCases);

// Check overall results
console.log(`Overall Score: ${results.overallScore}/100`);
console.log(`Passed: ${results.passed}/${results.passed + results.failed}`);
console.log(`Failed: ${results.failed}/${results.passed + results.failed}`);

// Review individual cases
for (const caseResult of results.cases) {
  console.log(`\nTest: ${caseResult.testCase.query}`);
  console.log(`Result: ${caseResult.result}`);
  console.log(`Score: ${caseResult.score}`);
  console.log(`Response: ${caseResult.actualResponse}`);
  console.log(`Evaluation: ${caseResult.evaluation}`);
}

// Get improvement suggestions
if (results.suggestions.length > 0) {
  console.log('\nSuggestions for improvement:');
  results.suggestions.forEach((suggestion, i) => {
    console.log(`${i + 1}. ${suggestion}`);
  });
}

Interpreting Results

const results = await prompt.test(openai('gpt-4'), testCases);

// High-level metrics
if (results.overallScore >= 80) {
  console.log('Prompt is performing well!');
} else if (results.overallScore >= 60) {
  console.log('Prompt needs some improvements');
} else {
  console.log('Prompt needs significant work');
}

// Identify problematic areas
const failedCases = results.cases.filter(c => c.result === 'fail');
if (failedCases.length > 0) {
  console.log('\nFailed test cases:');
  failedCases.forEach(c => {
    console.log(`- ${c.testCase.query}`);
    console.log(`  Issue: ${c.evaluation}`);
  });
}

// Apply suggestions
if (results.suggestions.length > 0) {
  console.log('\nRecommended actions:');
  results.suggestions.forEach(s => console.log(`- ${s}`));
}

Iterative Testing

// Initial prompt
let prompt = new PromptSmith()
  .withIdentity("Assistant");

let results = await prompt.test(openai('gpt-4'), testCases);

// Iterate based on suggestions
while (results.overallScore < 80) {
  console.log('Score:', results.overallScore);
  console.log('Suggestions:', results.suggestions);
  
  // Apply improvements manually based on suggestions
  // Then re-test
  results = await prompt.test(openai('gpt-4'), testCases);
}

console.log('Prompt optimized! Final score:', results.overallScore);

Build docs developers (and LLMs) love