Skip to main content

Overview

This guide walks you through running the complete NL2FOL pipeline: from translating natural language to first-order logic, converting to SMT format, and evaluating results.

Basic Workflow

1

Convert Natural Language to First-Order Logic

Use nl_to_fol.py to process a dataset and generate logical formulas
2

Convert FOL to SMT and Run Solver

Use fol_to_cvc.py to convert formulas to SMT format and verify them
3

Evaluate Results

Use get_metrics.py to calculate accuracy, precision, recall, and F1 score

Step 1: Natural Language to FOL

The main entry point is nl_to_fol.py, which processes datasets and generates first-order logic formulas.

Command Syntax

python3 src/nl_to_fol.py \
  --model_name <model_name> \
  --nli_model_name <nli_model_name> \
  --run_name <run_name> \
  --dataset <logic|logicclimate|nli|folio> \
  --length <number_of_samples>

Example: Using GPT-4

python3 src/nl_to_fol.py \
  --model_name gpt-4o \
  --nli_model_name roberta-large-mnli \
  --run_name my_first_run \
  --dataset logic \
  --length 10
GPT models (starting with “gpt-”) use the OpenAI API. Ensure your OPENAI_API_KEY environment variable is set.

Example: Using Llama

python3 src/nl_to_fol.py \
  --model_name meta-llama/Llama-2-7b-hf \
  --nli_model_name roberta-large-mnli \
  --run_name llama_run \
  --dataset logic \
  --length 10

Available Datasets

# General logical fallacies (3,227 examples)
python3 src/nl_to_fol.py \
  --model_name gpt-4o \
  --nli_model_name roberta-large-mnli \
  --run_name logic_test \
  --dataset logic \
  --length 50

Output

This generates a CSV file at results/<run_name>.csv with columns:
  • articles: Original natural language text
  • Claim: Extracted claim from the sentence
  • Implication: Extracted implication
  • Referring Expressions - Claim/Implication: Entity references
  • Property Implications: Property relationships
  • Equal Entities: Entities identified as equal
  • Subset Entities: Subset relationships between entities
  • Claim Lfs: Logical form for claim
  • Implication Lfs: Logical form for implication
  • Logical Form: Final combined formula (version 1)
  • Logical Form 2: Final combined formula (version 2)
  • label: Ground truth (0=fallacy, 1=valid)

Step 2: FOL to SMT Verification

Convert the generated logical formulas to SMT format and verify them using CVC4/CVC5.

Command Syntax

python3 src/fol_to_cvc.py <run_name>

Example

python3 src/fol_to_cvc.py my_first_run
This will:
  1. Read the FOL formulas from results/my_first_run.csv
  2. Generate SMT-LIB files in results/my_first_run_smt/
  3. Run CVC4 on each formula
  4. Save results to results/my_first_run_results.csv

SMT Results

The solver produces one of three results:

Valid

Formula is unsatisfiable (negation fails) - the argument is logically valid

LF

Formula is satisfiable or unknown - the argument contains a logical fallacy

Empty

Processing error occurred
The code uses “Logical Form 2” by default, which applies different quantifier scoping heuristics than “Logical Form”.

Step 3: Evaluate Results

Calculate performance metrics on the results.

Command Syntax

python3 eval/get_metrics.py <results_csv_path>

Example

python3 eval/get_metrics.py results/my_first_run_results.csv

Output Metrics

The script outputs:
(accuracy, precision, recall, f1_score)
Example output:
(0.85, 0.82, 0.88, 0.85)
This represents:
  • Accuracy: 85% of predictions are correct
  • Precision: 82% of predicted fallacies are actual fallacies
  • Recall: 88% of actual fallacies are detected
  • F1 Score: 85% harmonic mean of precision and recall

Complete Example

Here’s a complete end-to-end example:
# Step 1: Generate FOL formulas from 20 examples
python3 src/nl_to_fol.py \
  --model_name gpt-4o \
  --nli_model_name roberta-large-mnli \
  --run_name quickstart_demo \
  --dataset logic \
  --length 20

# Step 2: Convert to SMT and verify
python3 src/fol_to_cvc.py quickstart_demo

# Step 3: Evaluate results
python3 eval/get_metrics.py results/quickstart_demo_results.csv
Processing can take several minutes per example due to multiple LLM calls. Start with small dataset sizes (10-20 examples) for testing.

Understanding the NL2FOL Class

The core logic is implemented in the NL2FOL class from src/nl_to_fol.py:15. Here’s how it works:

Initialization

from nl_to_fol import NL2FOL

nl2fol = NL2FOL(
    sentence="All birds can fly. Penguins are birds. Therefore, penguins can fly.",
    model_type='gpt',
    pipeline=None,
    tokenizer=None,
    nli_model=nli_model,
    nli_tokenizer=nli_tokenizer,
    debug=True
)

Conversion Pipeline

The convert_to_first_order_logic() method executes these steps:
1

Extract Claim and Implication

Separates the premise (claim) from the conclusion (implication)
nl2fol.extract_claim_and_implication()
# claim: "All birds can fly. Penguins are birds."
# implication: "Penguins can fly."
2

Get Referring Expressions

Identifies entities mentioned in the text
nl2fol.get_referring_expressions()
# claim_ref_exp: "birds, penguins"
3

Determine Entity Relations

Uses NLI to find relationships between entities
nl2fol.get_entity_relations()
# subset_entities: [("penguins", "birds")]
4

Map Entities to Variables

Creates logical variable mappings
nl2fol.get_entity_mapping()
# entity_mappings: {"birds": "a", "penguins": "b"}
5

Extract Properties

Identifies predicates and properties
nl2fol.get_properties()
# claim_properties: "canFly(a)"
6

Generate Logical Forms

Creates FOL formulas for claim and implication
nl2fol.get_fol()
# claim_lf: "canFly(a) and isPenguin(b)"
# implication_lf: "canFly(b)"
7

Combine into Final Formula

Applies quantifiers and creates the final implication
nl2fol.get_final_lf()
# final_lf: "(exists a (canFly(a))) -> (forall b (canFly(b)))"

Advanced Usage

Custom Sentence Processing

You can process individual sentences programmatically:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from nl_to_fol import NL2FOL
from openai import OpenAI

# Initialize NLI model
nli_tokenizer = AutoTokenizer.from_pretrained("roberta-large-mnli")
nli_model = AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli")

# Process a sentence
sentence = "All politicians are corrupt. John is a politician. Therefore, John is corrupt."

nl2fol = NL2FOL(
    sentence=sentence,
    model_type='gpt',
    pipeline=None,
    tokenizer=None,
    nli_model=nli_model,
    nli_tokenizer=nli_tokenizer,
    debug=True
)

final_lf, final_lf2 = nl2fol.convert_to_first_order_logic()
print(f"Logical Form: {final_lf}")

Interpreting SMT Results

To understand why a formula was classified a certain way:
python3 src/interpret_smt_results.py \
  results/my_run_smt/5_out.txt \
  results/my_run.csv
This provides detailed explanations of the SMT solver’s reasoning.

Troubleshooting

Add delays between requests or reduce dataset size:
--length 5
Use smaller models or CPU inference:
export CUDA_VISIBLE_DEVICES=""
The code automatically enables finite model finding for “unknown” results. If still timing out, the formula may be too complex.
Check that:
  • CVC4/CVC5 binary exists and is executable
  • SMT files were generated in results/<run_name>_smt/
  • FOL formulas are well-formed in the CSV

Next Steps

API Reference

Explore detailed API documentation

Examples

View more complex usage examples

Build docs developers (and LLMs) love