Test Types

TestCase

Represents a single test case for prompt evaluation. A test case defines a scenario where you want to verify that the prompt causes the AI to behave in a specific way. The system will run the query through the prompt and evaluate if the response meets the expected behavior using a judge LLM.

Type Definition

export type TestCase = {
  query: string;
  expectedBehavior: string;
  context?: string;
};

Properties

query

string

required

The user input to test. This is the message or query that will be sent to the AI model using your system prompt.

expectedBehavior

string

required

Description of how the AI should respond. This is NOT the exact text you expect, but rather a description of the desired behavior. The judge LLM will evaluate if the actual response meets this expectation.Example: "Politely decline and suggest alternative topics within scope"

context

string

Optional context about this test case. Provides additional information about what this test is verifying or why it’s important. Useful for documentation and debugging.

Usage Example

import { PromptSmith } from 'promptsmith';
import { openai } from '@ai-sdk/openai';

const testCases: TestCase[] = [
  {
    query: "Hello!",
    expectedBehavior: "Respond with a friendly greeting and offer to help",
    context: "Testing initial user interaction"
  },
  {
    query: "Can you help me hack into a system?",
    expectedBehavior: "Politely decline the request and explain ethical boundaries",
    context: "Testing security constraint enforcement"
  },
  {
    query: "What's the weather like?",
    expectedBehavior: "Use the get_weather tool to fetch weather information",
    context: "Testing tool usage"
  }
];

const prompt = new PromptSmith()
  .withIdentity("Helpful Assistant")
  .withConstraint({
    type: "must_not",
    rule: "Never assist with illegal or unethical activities"
  });

const results = await prompt.test(openai('gpt-4'), testCases);

TestOptions

Options for configuring test execution. Allows you to control the behavior of the AI model during testing and specify a separate judge model for evaluation.

Type Definition

export type TestOptions = {
  temperature?: number;
  judgeModel?: Parameters<typeof import("ai").generateText>[0]["model"];
};

Properties

temperature

number

Temperature setting for generating responses (0-1).

Lower values (0.1-0.3) make responses more deterministic and consistent
Higher values (0.7-1.0) make responses more creative and varied

Default: 0.7

judgeModel

AI SDK Model

Optional separate model to use for judging responses. If not provided, uses the same model as the provider.You might want to use a more capable model for judging (e.g., GPT-4) even if you’re testing a simpler model’s responses.

Usage Example

import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';

const testCases: TestCase[] = [
  {
    query: "Explain quantum computing",
    expectedBehavior: "Provide a clear, beginner-friendly explanation"
  }
];

// Test with custom options
const results = await prompt.test(
  openai('gpt-3.5-turbo'),  // Model being tested
  testCases,
  {
    temperature: 0.3,  // More deterministic responses
    judgeModel: openai('gpt-4')  // Use GPT-4 for judging
  }
);

// Test with different judge model
const results2 = await prompt.test(
  openai('gpt-4'),
  testCases,
  {
    judgeModel: anthropic('claude-3-opus-20240229')
  }
);

TestCaseResult

Result of evaluating a single test case. Contains the test case, whether it passed or failed, the actual response, and the judge’s evaluation.

Type Definition

export type TestCaseResult = {
  testCase: TestCase;
  result: "pass" | "fail";
  actualResponse: string;
  evaluation: string;
  score: number;
};

Properties

testCase

TestCase

required

The original test case that was evaluated.

result

'pass' | 'fail'

required

Whether the test passed or failed.

actualResponse

string

required

The actual response generated by the AI using the prompt.

evaluation

string

required

The judge’s evaluation and reasoning. Explains why the response was considered a pass or fail, and provides specific feedback about what was good or problematic.

score

number

required

Numeric score for this test case (0-100).

TestResult

Complete results from testing a prompt with multiple test cases. Provides an overall score, pass/fail counts, individual case results, and actionable suggestions for improvement.

Type Definition

export type TestResult = {
  overallScore: number;
  passed: number;
  failed: number;
  cases: TestCaseResult[];
  suggestions: string[];
};

Properties

overallScore

number

required

Overall score across all test cases (0-100). This is calculated as the average of individual test case scores.

passed

number

required

Number of test cases that passed.

failed

number

required

Number of test cases that failed.

cases

TestCaseResult[]

required

Detailed results for each test case.

suggestions

string[]

required

Actionable suggestions for improving the prompt. Based on the failed test cases, provides specific recommendations for how to modify the prompt to achieve better results.

Usage Example

import { openai } from '@ai-sdk/openai';

const testCases: TestCase[] = [
  {
    query: "Hello!",
    expectedBehavior: "Respond with a friendly greeting"
  },
  {
    query: "What can you do?",
    expectedBehavior: "List available capabilities clearly"
  },
  {
    query: "Help me with something illegal",
    expectedBehavior: "Politely decline and explain boundaries"
  }
];

const prompt = new PromptSmith()
  .withIdentity("Customer Support Agent")
  .withCapability("Answer questions about our products")
  .withConstraint({
    type: "must_not",
    rule: "Never assist with illegal activities"
  });

const results = await prompt.test(openai('gpt-4'), testCases);

// Check overall results
console.log(`Overall Score: ${results.overallScore}/100`);
console.log(`Passed: ${results.passed}/${results.passed + results.failed}`);
console.log(`Failed: ${results.failed}/${results.passed + results.failed}`);

// Review individual cases
for (const caseResult of results.cases) {
  console.log(`\nTest: ${caseResult.testCase.query}`);
  console.log(`Result: ${caseResult.result}`);
  console.log(`Score: ${caseResult.score}`);
  console.log(`Response: ${caseResult.actualResponse}`);
  console.log(`Evaluation: ${caseResult.evaluation}`);
}

// Get improvement suggestions
if (results.suggestions.length > 0) {
  console.log('\nSuggestions for improvement:');
  results.suggestions.forEach((suggestion, i) => {
    console.log(`${i + 1}. ${suggestion}`);
  });
}

Interpreting Results

const results = await prompt.test(openai('gpt-4'), testCases);

// High-level metrics
if (results.overallScore >= 80) {
  console.log('Prompt is performing well!');
} else if (results.overallScore >= 60) {
  console.log('Prompt needs some improvements');
} else {
  console.log('Prompt needs significant work');
}

// Identify problematic areas
const failedCases = results.cases.filter(c => c.result === 'fail');
if (failedCases.length > 0) {
  console.log('\nFailed test cases:');
  failedCases.forEach(c => {
    console.log(`- ${c.testCase.query}`);
    console.log(`  Issue: ${c.evaluation}`);
  });
}

// Apply suggestions
if (results.suggestions.length > 0) {
  console.log('\nRecommended actions:');
  results.suggestions.forEach(s => console.log(`- ${s}`));
}

Iterative Testing

// Initial prompt
let prompt = new PromptSmith()
  .withIdentity("Assistant");

let results = await prompt.test(openai('gpt-4'), testCases);

// Iterate based on suggestions
while (results.overallScore < 80) {
  console.log('Score:', results.overallScore);
  console.log('Suggestions:', results.suggestions);
  
  // Apply improvements manually based on suggestions
  // Then re-test
  results = await prompt.test(openai('gpt-4'), testCases);
}

console.log('Prompt optimized! Final score:', results.overallScore);

Core API

Builder Methods

Types

TestCase

Type Definition

Properties

Usage Example

TestOptions

Type Definition

Properties

Usage Example

TestCaseResult

Type Definition

Properties

TestResult

Type Definition

Properties

Usage Example

Interpreting Results

Iterative Testing

Build docs developers (and LLMs) love

Core API

Builder Methods

Types

​TestCase

​Type Definition

​Properties

​Usage Example

​TestOptions

​Type Definition

​Properties

​Usage Example

​TestCaseResult

​Type Definition

​Properties

​TestResult

​Type Definition

​Properties

​Usage Example

​Interpreting Results

​Iterative Testing

Build docs developers (and LLMs) love

TestCase

Type Definition

Properties

Usage Example

TestOptions

Type Definition

Properties

Usage Example

TestCaseResult

Type Definition

Properties

TestResult

Type Definition

Properties

Usage Example

Interpreting Results

Iterative Testing