Overview
The get_metrics.py script calculates standard classification metrics to evaluate the performance of the NL2FOL system on logical fallacy detection tasks.
How It Works
Load Results CSV
Reads the results file generated by fol_to_cvc.py containing predictions.
Extract Labels and Predictions
Ground truth labels: label column (0=fallacy, 1=valid)
Predictions: result column (“LF”=fallacy, “Valid”=valid)
Convert to Numerical Format
Maps categorical results to binary format:
“LF” → 0 (fallacy)
“Valid” → 1 (valid)
Empty/Error → 1 (conservative fallback)
Calculate Metrics
Computes accuracy, precision, recall, and F1 score using scikit-learn.
Display Results
Prints the metrics as a tuple: (accuracy, precision, recall, f1)
Command Usage
Basic Command
python3 eval/get_metrics.py < path_to_results_cs v >
Parameters
Path to the CSV file containing results from fol_to_cvc.py. Expected format: results/<run_name>_results.csv Required columns:
label - Ground truth (0 or 1)
result - Prediction (“LF”, “Valid”, or empty)
Example Usage
Basic Usage
Multiple Experiments
Save to File
python3 eval/get_metrics.py results/llama_experiment_1_results.csv
The script outputs a tuple of four metrics:
(accuracy, precision, recall, f1_score)
Example Output:
(0.8421052631578947, 0.8333333333333334, 0.8695652173913043, 0.8510638297872342)
Interpreted as:
Accuracy : 84.21%
Precision : 83.33%
Recall : 86.96%
F1 Score : 85.11%
Metrics Explained
Accuracy
accuracy = ( TP + TN ) / ( TP + TN + FP + FN )
Percentage of correct predictions (both valid arguments and fallacies).
Interpretation : Overall correctness of the system.
Precision
precision = TP / ( TP + FP )
Of all arguments marked as fallacies, what percentage are actually fallacies?
Interpretation : How reliable are fallacy detections? High precision means few false alarms.
Recall (Sensitivity)
Of all actual fallacies, what percentage are detected?
Interpretation : How many fallacies are caught? High recall means few fallacies slip through.
F1 Score
f1 = 2 * (precision * recall) / (precision + recall)
Harmonic mean of precision and recall.
Interpretation : Balanced measure of overall performance.
F1 score is particularly useful when classes are imbalanced or when both false positives and false negatives are equally important.
Label Encoding
The script inverts labels because it evaluates fallacy detection (not validity detection):
# Original labels
label = df[ 'label' ] # 0=fallacy, 1=valid
# Predictions
preds = pd.Categorical(
df[ 'result' ],
categories = [ 'LF' , 'Valid' ]
).codes # 0=LF (fallacy), 1=Valid
preds = np.where(preds == - 1 , 1 , preds) # Handle missing values
# Invert for fallacy detection
get_results( 1 - label, 1 - preds)
After inversion:
1 = Fallacy
0 = Valid argument
Implementation Details
Core Function
def get_results ( label , preds ):
acc = accuracy_score(label, preds)
prec = precision_score(label, preds)
rec = recall_score(label, preds)
f1 = f1_score(label, preds)
return acc, prec, rec, f1
Full Script Structure
import numpy as np
import pandas as pd
import argparse
from sklearn.metrics import (
accuracy_score,
f1_score,
precision_score,
recall_score
)
if __name__ == '__main__' :
parser = argparse.ArgumentParser(
description = 'Evaluate model performance.'
)
parser.add_argument(
'filename' ,
type = str ,
help = 'Path to the CSV file'
)
args = parser.parse_args()
df = pd.read_csv(args.filename)
label = df[ 'label' ]
preds = pd.Categorical(
df[ 'result' ],
categories = [ 'LF' , 'Valid' ]
).codes
preds = np.where(preds == - 1 , 1 , preds)
print (get_results( 1 - label, 1 - preds))
Understanding Results
High Accuracy, Low Precision
Scenario : System predicts “valid” for most arguments.
Accuracy: 90%
Precision: 50%
Recall: 95%
Interpretation : Catches most fallacies but has many false positives.
High Precision, Low Recall
Scenario : System only flags obvious fallacies.
Accuracy: 85%
Precision: 95%
Recall: 60%
Interpretation : When it flags a fallacy, it’s usually correct, but misses many fallacies.
Accuracy: 85%
Precision: 84%
Recall: 86%
F1: 85%
Interpretation : Well-balanced system with consistent performance.
For fallacy detection in educational contexts, prioritize recall (don’t miss fallacies). For automated content moderation, prioritize precision (avoid false accusations).
Handling Missing Values
Empty predictions (from processing errors) are treated as “Valid”:
preds = np.where(preds == - 1 , 1 , preds)
This conservative approach:
Avoids false fallacy accusations
Penalizes recall (increases false negatives)
Reflects real-world system behavior
Alternative : Filter out missing values before evaluation:
# Remove rows with missing predictions
df_clean = df[df[ 'result' ].notna() & (df[ 'result' ] != '' )]
label = df_clean[ 'label' ]
preds = pd.Categorical(df_clean[ 'result' ], categories = [ 'LF' , 'Valid' ]).codes
Confusion Matrix Analysis
Extend the script for detailed analysis:
from sklearn.metrics import confusion_matrix, classification_report
# Confusion matrix
cm = confusion_matrix( 1 - label, 1 - preds)
print ( " \n Confusion Matrix:" )
print ( " Predicted" )
print ( " Valid Fallacy" )
print ( f "Actual Valid { cm[ 0 , 0 ] :5d} { cm[ 0 , 1 ] :5d} " )
print ( f "Actual Fallacy { cm[ 1 , 0 ] :5d} { cm[ 1 , 1 ] :5d} " )
# Detailed report
print ( " \n Classification Report:" )
print (classification_report(
1 - label,
1 - preds,
target_names = [ 'Valid' , 'Fallacy' ]
))
Example Output:
Confusion Matrix:
Predicted
Valid Fallacy
Actual Valid 42 8
Actual Fallacy 6 44
Classification Report:
precision recall f1-score support
Valid 0.88 0.84 0.86 50
Fallacy 0.85 0.88 0.86 50
accuracy 0.86 100
macro avg 0.86 0.86 0.86 100
weighted avg 0.86 0.86 0.86 100
Per-Dataset Analysis
Compare performance across datasets:
import pandas as pd
from get_metrics import get_results
datasets = [ 'logic' , 'logicclimate' , 'folio' ]
results = []
for dataset in datasets:
df = pd.read_csv( f 'results/ { dataset } _results.csv' )
label = df[ 'label' ]
preds = pd.Categorical(df[ 'result' ], categories = [ 'LF' , 'Valid' ]).codes
preds = np.where(preds == - 1 , 1 , preds)
metrics = get_results( 1 - label, 1 - preds)
results.append({
'Dataset' : dataset,
'Accuracy' : metrics[ 0 ],
'Precision' : metrics[ 1 ],
'Recall' : metrics[ 2 ],
'F1' : metrics[ 3 ]
})
results_df = pd.DataFrame(results)
print (results_df.to_string( index = False ))
Statistical Significance
For comparing models, use bootstrap confidence intervals:
from sklearn.utils import resample
import numpy as np
def bootstrap_ci ( y_true , y_pred , metric_func , n_iterations = 1000 ):
scores = []
for _ in range (n_iterations):
indices = resample( range ( len (y_true)), n_samples = len (y_true))
y_true_boot = y_true[indices]
y_pred_boot = y_pred[indices]
scores.append(metric_func(y_true_boot, y_pred_boot))
lower = np.percentile(scores, 2.5 )
upper = np.percentile(scores, 97.5 )
return lower, upper
# Calculate 95% CI for F1 score
lower, upper = bootstrap_ci(
1 - label,
1 - preds,
f1_score
)
print ( f "F1 Score: { f1_score( 1 - label, 1 - preds) :.3f} " )
print ( f "95% CI: [ { lower :.3f} , { upper :.3f} ]" )
Visualization
Create visual reports:
import matplotlib.pyplot as plt
import seaborn as sns
# Bar chart of metrics
metrics = get_results( 1 - label, 1 - preds)
metric_names = [ 'Accuracy' , 'Precision' , 'Recall' , 'F1' ]
plt.figure( figsize = ( 10 , 6 ))
plt.bar(metric_names, metrics)
plt.ylim( 0 , 1 )
plt.ylabel( 'Score' )
plt.title( 'Model Performance Metrics' )
plt.grid( axis = 'y' , alpha = 0.3 )
plt.savefig( 'metrics.png' )
plt.show()
# Confusion matrix heatmap
from sklearn.metrics import confusion_matrix
cm = confusion_matrix( 1 - label, 1 - preds)
plt.figure( figsize = ( 8 , 6 ))
sns.heatmap(cm, annot = True , fmt = 'd' , cmap = 'Blues' ,
xticklabels = [ 'Valid' , 'Fallacy' ],
yticklabels = [ 'Valid' , 'Fallacy' ])
plt.xlabel( 'Predicted' )
plt.ylabel( 'Actual' )
plt.title( 'Confusion Matrix' )
plt.savefig( 'confusion_matrix.png' )
plt.show()
Export Results
Save metrics to file:
import json
metrics = get_results( 1 - label, 1 - preds)
results_dict = {
'accuracy' : float (metrics[ 0 ]),
'precision' : float (metrics[ 1 ]),
'recall' : float (metrics[ 2 ]),
'f1_score' : float (metrics[ 3 ])
}
with open ( 'metrics.json' , 'w' ) as f:
json.dump(results_dict, f, indent = 2 )
Benchmarking
Compare against baseline methods:
# Random baseline
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier( strategy = 'stratified' , random_state = 42 )
dummy.fit(np.zeros(( len (label), 1 )), label)
baseline_preds = dummy.predict(np.zeros(( len (label), 1 )))
print ( "Baseline (Random):" )
print (get_results( 1 - label, 1 - baseline_preds))
print ( " \n NL2FOL System:" )
print (get_results( 1 - label, 1 - preds))
Error Analysis
Identify problem cases:
# Find false positives and false negatives
df[ 'pred_binary' ] = preds
df[ 'label_inverted' ] = 1 - label
false_positives = df[
(df[ 'pred_binary' ] == 1 ) & (df[ 'label_inverted' ] == 0 )
]
false_negatives = df[
(df[ 'pred_binary' ] == 0 ) & (df[ 'label_inverted' ] == 1 )
]
print ( f "False Positives: { len (false_positives) } " )
print ( f "False Negatives: { len (false_negatives) } " )
# Export for analysis
false_positives.to_csv( 'false_positives.csv' , index = False )
false_negatives.to_csv( 'false_negatives.csv' , index = False )
Common Pitfalls
Label Confusion : Ensure consistent encoding (0/1 vs LF/Valid)
Missing Values : Decide how to handle processing errors
Class Imbalance : Consider using balanced accuracy or weighted F1
Data Leakage : Never evaluate on training data
Next Steps
After evaluating performance:
Error Analysis : Review false positives and false negatives
Prompt Refinement : Adjust prompts based on error patterns
Model Comparison : Test different LLMs or NLI models
Dataset Expansion : Evaluate on diverse fallacy types
Ablation Studies : Test impact of each pipeline component
See Development Guide for tips on improving the system.