Overview
bun_nltk provides multiple parsing algorithms for context-free grammars (CFG) and probabilistic context-free grammars (PCFG).Grammar Formats
CFG Grammar Syntax
Grammar Rule Syntax
- Format:
LHS -> RHS1 | RHS2 | ... - Terminals: Quoted strings
'word'or"word" - Nonterminals: Unquoted symbols
- Alternatives: Separated by
| - Comments: Lines starting with
#
PCFG Grammar Syntax
Add probabilities in square brackets:- Probabilities for each LHS should sum to 1.0
- Omitted probabilities are distributed uniformly
- Format:
[0.7]after RHS
Chart Parser (CYK)
The CYK algorithm converts grammars to Chomsky Normal Form and uses dynamic programming.Basic Parsing
Controlling Parse Trees
Earley Parser
The Earley algorithm handles arbitrary CFGs without conversion.Recognition Only
Full Parsing
earleyParse uses recognition to validate, then falls back to chart parsing for tree construction.
Probabilistic Parsing
PCFG parsing finds the most likely parse tree.Basic PCFG Parsing
Text Parsing Helpers
Convenience functions that tokenize and parse text.Parse Text with CFG
Parse Text with Earley
Parse Text with PCFG
Advanced Grammar Examples
Arithmetic Expression Grammar
Simple English Grammar
Working with Parse Trees
Traverse Parse Tree
Extract Terminals
Find Subtrees by Label
Performance Optimization
The parser uses native code for:- CYK recognition (bitset operations)
- CNF conversion and caching
- Chart cell operations