Evaluate PAS2’s hallucination detection accuracy using benchmark datasets
The PAS2 library includes a benchmarking tool for evaluating detection accuracy against labeled datasets. This helps you measure performance and tune parameters for your use case.
Your benchmark dataset should be in JSON Lines format (.jsonl), with each line containing:
{"ID": "sample_1", "user_query": "Who was the first person to land on the moon?", "hallucination": "no"}{"ID": "sample_2", "user_query": "What is the capital of France?", "hallucination": "no"}{"ID": "sample_3", "user_query": "How many planets are in our solar system?", "hallucination": "no"}
The benchmark script processes each sample through the PAS2 detection pipeline:
def run_hallucination_evaluator(json_file: str, num_samples: int = None): pas2 = PAS2() # Initialize the PAS2 library # Load data from JSON file data = [] with open(json_file, 'r', encoding='utf-8') as f: for line_num, line in enumerate(f, 1): line = line.strip() if line: try: sample = json.loads(line) data.append(sample) if num_samples and len(data) >= num_samples: break except json.JSONDecodeError as e: logger.error(f"Error decoding JSON on line {line_num}: {e}")
Each sample is evaluated using the PAS2 detection method:
for idx, sample in enumerate(data): sample_id = sample.get('ID') user_query = sample.get('user_query') true_label = sample.get('hallucination') # 'yes' or 'no' # Use the PAS2 library to detect hallucination hallucinated, initial_response, all_questions, all_responses = pas2.detect_hallucination( user_query, n_paraphrases=5, similarity_threshold=0.9, match_percentage_threshold=0.7 ) # Convert 'yes'/'no' to boolean for comparison true_hallucinated = true_label.strip().lower() == 'yes' # Compare the detected hallucination with the true label if hallucinated == true_hallucinated: correct_detections += 1
The legacy benchmark uses older parameter names (similarity_threshold, match_percentage_threshold) that are not present in the current PAS2 implementation. When adapting this code, use the current API with n_paraphrases only.
Add delays between API calls to respect rate limits:
import timefor sample in data: result = pas2.detect_hallucination(sample['user_query']) # Process result... # Wait 1 second between samples time.sleep(1)