Skip to main content

Overview

The get_metrics.py script calculates standard classification metrics to evaluate the performance of the NL2FOL system on logical fallacy detection tasks.

How It Works

1

Load Results CSV

Reads the results file generated by fol_to_cvc.py containing predictions.
2

Extract Labels and Predictions

  • Ground truth labels: label column (0=fallacy, 1=valid)
  • Predictions: result column (“LF”=fallacy, “Valid”=valid)
3

Convert to Numerical Format

Maps categorical results to binary format:
  • “LF” → 0 (fallacy)
  • “Valid” → 1 (valid)
  • Empty/Error → 1 (conservative fallback)
4

Calculate Metrics

Computes accuracy, precision, recall, and F1 score using scikit-learn.
5

Display Results

Prints the metrics as a tuple: (accuracy, precision, recall, f1)

Command Usage

Basic Command

python3 eval/get_metrics.py <path_to_results_csv>

Parameters

filename
string
required
Path to the CSV file containing results from fol_to_cvc.py.Expected format: results/<run_name>_results.csvRequired columns:
  • label - Ground truth (0 or 1)
  • result - Prediction (“LF”, “Valid”, or empty)

Example Usage

python3 eval/get_metrics.py results/llama_experiment_1_results.csv

Output Format

The script outputs a tuple of four metrics:
(accuracy, precision, recall, f1_score)
Example Output:
(0.8421052631578947, 0.8333333333333334, 0.8695652173913043, 0.8510638297872342)
Interpreted as:
  • Accuracy: 84.21%
  • Precision: 83.33%
  • Recall: 86.96%
  • F1 Score: 85.11%

Metrics Explained

Accuracy

accuracy = (TP + TN) / (TP + TN + FP + FN)
Percentage of correct predictions (both valid arguments and fallacies). Interpretation: Overall correctness of the system.

Precision

precision = TP / (TP + FP)
Of all arguments marked as fallacies, what percentage are actually fallacies? Interpretation: How reliable are fallacy detections? High precision means few false alarms.

Recall (Sensitivity)

recall = TP / (TP + FN)
Of all actual fallacies, what percentage are detected? Interpretation: How many fallacies are caught? High recall means few fallacies slip through.

F1 Score

f1 = 2 * (precision * recall) / (precision + recall)
Harmonic mean of precision and recall. Interpretation: Balanced measure of overall performance.
F1 score is particularly useful when classes are imbalanced or when both false positives and false negatives are equally important.

Label Encoding

The script inverts labels because it evaluates fallacy detection (not validity detection):
# Original labels
label = df['label']           # 0=fallacy, 1=valid

# Predictions
preds = pd.Categorical(
    df['result'], 
    categories=['LF', 'Valid']
).codes                        # 0=LF (fallacy), 1=Valid
preds = np.where(preds == -1, 1, preds)  # Handle missing values

# Invert for fallacy detection
get_results(1 - label, 1 - preds)
After inversion:
  • 1 = Fallacy
  • 0 = Valid argument

Implementation Details

Core Function

def get_results(label, preds):
    acc = accuracy_score(label, preds)
    prec = precision_score(label, preds)
    rec = recall_score(label, preds)
    f1 = f1_score(label, preds)
    return acc, prec, rec, f1

Full Script Structure

import numpy as np
import pandas as pd
import argparse
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score
)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='Evaluate model performance.'
    )
    parser.add_argument(
        'filename', 
        type=str, 
        help='Path to the CSV file'
    )
    args = parser.parse_args()
    
    df = pd.read_csv(args.filename)
    label = df['label']
    preds = pd.Categorical(
        df['result'], 
        categories=['LF','Valid']
    ).codes
    preds = np.where(preds == -1, 1, preds)
    
    print(get_results(1 - label, 1 - preds))

Understanding Results

High Accuracy, Low Precision

Scenario: System predicts “valid” for most arguments.
Accuracy: 90%
Precision: 50%
Recall: 95%
Interpretation: Catches most fallacies but has many false positives.

High Precision, Low Recall

Scenario: System only flags obvious fallacies.
Accuracy: 85%
Precision: 95%
Recall: 60%
Interpretation: When it flags a fallacy, it’s usually correct, but misses many fallacies.

Balanced Performance

Accuracy: 85%
Precision: 84%
Recall: 86%
F1: 85%
Interpretation: Well-balanced system with consistent performance.
For fallacy detection in educational contexts, prioritize recall (don’t miss fallacies).For automated content moderation, prioritize precision (avoid false accusations).

Handling Missing Values

Empty predictions (from processing errors) are treated as “Valid”:
preds = np.where(preds == -1, 1, preds)
This conservative approach:
  • Avoids false fallacy accusations
  • Penalizes recall (increases false negatives)
  • Reflects real-world system behavior
Alternative: Filter out missing values before evaluation:
# Remove rows with missing predictions
df_clean = df[df['result'].notna() & (df['result'] != '')]
label = df_clean['label']
preds = pd.Categorical(df_clean['result'], categories=['LF','Valid']).codes

Confusion Matrix Analysis

Extend the script for detailed analysis:
from sklearn.metrics import confusion_matrix, classification_report

# Confusion matrix
cm = confusion_matrix(1 - label, 1 - preds)
print("\nConfusion Matrix:")
print("                Predicted")
print("                Valid  Fallacy")
print(f"Actual Valid    {cm[0,0]:5d}  {cm[0,1]:5d}")
print(f"Actual Fallacy  {cm[1,0]:5d}  {cm[1,1]:5d}")

# Detailed report
print("\nClassification Report:")
print(classification_report(
    1 - label, 
    1 - preds,
    target_names=['Valid', 'Fallacy']
))
Example Output:
Confusion Matrix:
                Predicted
                Valid  Fallacy
Actual Valid       42      8
Actual Fallacy      6     44

Classification Report:
              precision    recall  f1-score   support

       Valid       0.88      0.84      0.86        50
     Fallacy       0.85      0.88      0.86        50

    accuracy                           0.86       100
   macro avg       0.86      0.86      0.86       100
weighted avg       0.86      0.86      0.86       100

Per-Dataset Analysis

Compare performance across datasets:
import pandas as pd
from get_metrics import get_results

datasets = ['logic', 'logicclimate', 'folio']
results = []

for dataset in datasets:
    df = pd.read_csv(f'results/{dataset}_results.csv')
    label = df['label']
    preds = pd.Categorical(df['result'], categories=['LF','Valid']).codes
    preds = np.where(preds == -1, 1, preds)
    
    metrics = get_results(1 - label, 1 - preds)
    results.append({
        'Dataset': dataset,
        'Accuracy': metrics[0],
        'Precision': metrics[1],
        'Recall': metrics[2],
        'F1': metrics[3]
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

Statistical Significance

For comparing models, use bootstrap confidence intervals:
from sklearn.utils import resample
import numpy as np

def bootstrap_ci(y_true, y_pred, metric_func, n_iterations=1000):
    scores = []
    for _ in range(n_iterations):
        indices = resample(range(len(y_true)), n_samples=len(y_true))
        y_true_boot = y_true[indices]
        y_pred_boot = y_pred[indices]
        scores.append(metric_func(y_true_boot, y_pred_boot))
    
    lower = np.percentile(scores, 2.5)
    upper = np.percentile(scores, 97.5)
    return lower, upper

# Calculate 95% CI for F1 score
lower, upper = bootstrap_ci(
    1 - label, 
    1 - preds, 
    f1_score
)
print(f"F1 Score: {f1_score(1-label, 1-preds):.3f}")
print(f"95% CI: [{lower:.3f}, {upper:.3f}]")

Visualization

Create visual reports:
import matplotlib.pyplot as plt
import seaborn as sns

# Bar chart of metrics
metrics = get_results(1 - label, 1 - preds)
metric_names = ['Accuracy', 'Precision', 'Recall', 'F1']

plt.figure(figsize=(10, 6))
plt.bar(metric_names, metrics)
plt.ylim(0, 1)
plt.ylabel('Score')
plt.title('Model Performance Metrics')
plt.grid(axis='y', alpha=0.3)
plt.savefig('metrics.png')
plt.show()

# Confusion matrix heatmap
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(1 - label, 1 - preds)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Valid', 'Fallacy'],
            yticklabels=['Valid', 'Fallacy'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png')
plt.show()

Export Results

Save metrics to file:
import json

metrics = get_results(1 - label, 1 - preds)
results_dict = {
    'accuracy': float(metrics[0]),
    'precision': float(metrics[1]),
    'recall': float(metrics[2]),
    'f1_score': float(metrics[3])
}

with open('metrics.json', 'w') as f:
    json.dump(results_dict, f, indent=2)

Benchmarking

Compare against baseline methods:
# Random baseline
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='stratified', random_state=42)
dummy.fit(np.zeros((len(label), 1)), label)
baseline_preds = dummy.predict(np.zeros((len(label), 1)))

print("Baseline (Random):")
print(get_results(1 - label, 1 - baseline_preds))

print("\nNL2FOL System:")
print(get_results(1 - label, 1 - preds))

Error Analysis

Identify problem cases:
# Find false positives and false negatives
df['pred_binary'] = preds
df['label_inverted'] = 1 - label

false_positives = df[
    (df['pred_binary'] == 1) & (df['label_inverted'] == 0)
]
false_negatives = df[
    (df['pred_binary'] == 0) & (df['label_inverted'] == 1)
]

print(f"False Positives: {len(false_positives)}")
print(f"False Negatives: {len(false_negatives)}")

# Export for analysis
false_positives.to_csv('false_positives.csv', index=False)
false_negatives.to_csv('false_negatives.csv', index=False)
Common Pitfalls
  1. Label Confusion: Ensure consistent encoding (0/1 vs LF/Valid)
  2. Missing Values: Decide how to handle processing errors
  3. Class Imbalance: Consider using balanced accuracy or weighted F1
  4. Data Leakage: Never evaluate on training data

Next Steps

After evaluating performance:
  1. Error Analysis: Review false positives and false negatives
  2. Prompt Refinement: Adjust prompts based on error patterns
  3. Model Comparison: Test different LLMs or NLI models
  4. Dataset Expansion: Evaluate on diverse fallacy types
  5. Ablation Studies: Test impact of each pipeline component
See Development Guide for tips on improving the system.

Build docs developers (and LLMs) love