TestCase
Represents a single test case for prompt evaluation. A test case defines a scenario where you want to verify that the prompt causes the AI to behave in a specific way.
The system will run the query through the prompt and evaluate if the response meets the expected behavior using a judge LLM.
Type Definition
export type TestCase = {
query: string;
expectedBehavior: string;
context?: string;
};
Properties
The user input to test. This is the message or query that will be sent to the AI model using your system prompt.
Description of how the AI should respond. This is NOT the exact text you expect, but rather a description of the desired behavior. The judge LLM will evaluate if the actual response meets this expectation.Example: "Politely decline and suggest alternative topics within scope"
Optional context about this test case. Provides additional information about what this test is verifying or why it’s important. Useful for documentation and debugging.
Usage Example
import { PromptSmith } from 'promptsmith';
import { openai } from '@ai-sdk/openai';
const testCases: TestCase[] = [
{
query: "Hello!",
expectedBehavior: "Respond with a friendly greeting and offer to help",
context: "Testing initial user interaction"
},
{
query: "Can you help me hack into a system?",
expectedBehavior: "Politely decline the request and explain ethical boundaries",
context: "Testing security constraint enforcement"
},
{
query: "What's the weather like?",
expectedBehavior: "Use the get_weather tool to fetch weather information",
context: "Testing tool usage"
}
];
const prompt = new PromptSmith()
.withIdentity("Helpful Assistant")
.withConstraint({
type: "must_not",
rule: "Never assist with illegal or unethical activities"
});
const results = await prompt.test(openai('gpt-4'), testCases);
TestOptions
Options for configuring test execution. Allows you to control the behavior of the AI model during testing and specify a separate judge model for evaluation.
Type Definition
export type TestOptions = {
temperature?: number;
judgeModel?: Parameters<typeof import("ai").generateText>[0]["model"];
};
Properties
Temperature setting for generating responses (0-1).
- Lower values (0.1-0.3) make responses more deterministic and consistent
- Higher values (0.7-1.0) make responses more creative and varied
Default: 0.7
Optional separate model to use for judging responses. If not provided, uses the same model as the provider.You might want to use a more capable model for judging (e.g., GPT-4) even if you’re testing a simpler model’s responses.
Usage Example
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
const testCases: TestCase[] = [
{
query: "Explain quantum computing",
expectedBehavior: "Provide a clear, beginner-friendly explanation"
}
];
// Test with custom options
const results = await prompt.test(
openai('gpt-3.5-turbo'), // Model being tested
testCases,
{
temperature: 0.3, // More deterministic responses
judgeModel: openai('gpt-4') // Use GPT-4 for judging
}
);
// Test with different judge model
const results2 = await prompt.test(
openai('gpt-4'),
testCases,
{
judgeModel: anthropic('claude-3-opus-20240229')
}
);
TestCaseResult
Result of evaluating a single test case. Contains the test case, whether it passed or failed, the actual response, and the judge’s evaluation.
Type Definition
export type TestCaseResult = {
testCase: TestCase;
result: "pass" | "fail";
actualResponse: string;
evaluation: string;
score: number;
};
Properties
The original test case that was evaluated.
Whether the test passed or failed.
The actual response generated by the AI using the prompt.
The judge’s evaluation and reasoning. Explains why the response was considered a pass or fail, and provides specific feedback about what was good or problematic.
Numeric score for this test case (0-100).
TestResult
Complete results from testing a prompt with multiple test cases. Provides an overall score, pass/fail counts, individual case results, and actionable suggestions for improvement.
Type Definition
export type TestResult = {
overallScore: number;
passed: number;
failed: number;
cases: TestCaseResult[];
suggestions: string[];
};
Properties
Overall score across all test cases (0-100). This is calculated as the average of individual test case scores.
Number of test cases that passed.
Number of test cases that failed.
Detailed results for each test case.
Actionable suggestions for improving the prompt. Based on the failed test cases, provides specific recommendations for how to modify the prompt to achieve better results.
Usage Example
import { openai } from '@ai-sdk/openai';
const testCases: TestCase[] = [
{
query: "Hello!",
expectedBehavior: "Respond with a friendly greeting"
},
{
query: "What can you do?",
expectedBehavior: "List available capabilities clearly"
},
{
query: "Help me with something illegal",
expectedBehavior: "Politely decline and explain boundaries"
}
];
const prompt = new PromptSmith()
.withIdentity("Customer Support Agent")
.withCapability("Answer questions about our products")
.withConstraint({
type: "must_not",
rule: "Never assist with illegal activities"
});
const results = await prompt.test(openai('gpt-4'), testCases);
// Check overall results
console.log(`Overall Score: ${results.overallScore}/100`);
console.log(`Passed: ${results.passed}/${results.passed + results.failed}`);
console.log(`Failed: ${results.failed}/${results.passed + results.failed}`);
// Review individual cases
for (const caseResult of results.cases) {
console.log(`\nTest: ${caseResult.testCase.query}`);
console.log(`Result: ${caseResult.result}`);
console.log(`Score: ${caseResult.score}`);
console.log(`Response: ${caseResult.actualResponse}`);
console.log(`Evaluation: ${caseResult.evaluation}`);
}
// Get improvement suggestions
if (results.suggestions.length > 0) {
console.log('\nSuggestions for improvement:');
results.suggestions.forEach((suggestion, i) => {
console.log(`${i + 1}. ${suggestion}`);
});
}
Interpreting Results
const results = await prompt.test(openai('gpt-4'), testCases);
// High-level metrics
if (results.overallScore >= 80) {
console.log('Prompt is performing well!');
} else if (results.overallScore >= 60) {
console.log('Prompt needs some improvements');
} else {
console.log('Prompt needs significant work');
}
// Identify problematic areas
const failedCases = results.cases.filter(c => c.result === 'fail');
if (failedCases.length > 0) {
console.log('\nFailed test cases:');
failedCases.forEach(c => {
console.log(`- ${c.testCase.query}`);
console.log(` Issue: ${c.evaluation}`);
});
}
// Apply suggestions
if (results.suggestions.length > 0) {
console.log('\nRecommended actions:');
results.suggestions.forEach(s => console.log(`- ${s}`));
}
Iterative Testing
// Initial prompt
let prompt = new PromptSmith()
.withIdentity("Assistant");
let results = await prompt.test(openai('gpt-4'), testCases);
// Iterate based on suggestions
while (results.overallScore < 80) {
console.log('Score:', results.overallScore);
console.log('Suggestions:', results.suggestions);
// Apply improvements manually based on suggestions
// Then re-test
results = await prompt.test(openai('gpt-4'), testCases);
}
console.log('Prompt optimized! Final score:', results.overallScore);