Overview
Chunking identifies non-overlapping linguistic structures like noun phrases, verb phrases, or named entities from part-of-speech tagged text.Regexp Chunk Parser
Basic Usage
Grammar Syntax
Rule Format
- LABEL: Chunk label (alphanumeric + underscore)
- TAG_PATTERN: Regular expression in angle brackets
<...> - Quantifiers:
?(0-1),*(0+),+(1+)
Tag Patterns
Pattern Examples
| Pattern | Matches |
|---|---|
<NN> | Exactly “NN” |
<NN.*> | ”NN”, “NNS”, “NNP”, “NNPS” |
<VB|VBD|VBZ> | ”VB” or “VBD” or “VBZ” |
<DT>? | Optional determiner |
<JJ>* | Zero or more adjectives |
<NN>+ | One or more nouns |
Advanced Chunking
Named Entity Recognition
Multi-Rule Grammars
IOB Format Conversion
Chunk Tree to IOB
Convert chunk trees to Inside-Outside-Begin (IOB) format:IOB Tags
- B-LABEL: Beginning of chunk
- I-LABEL: Inside chunk (continuation)
- O: Outside any chunk
Type Definitions
Practical Examples
Extract Noun Phrases
Extract Action Phrases
Custom Entity Types
Performance Notes
The chunker uses native code optimization for better performance:- Pattern compilation is cached
- Tag matching uses precompiled regex
- IOB encoding uses efficient native implementation
Working with Chunk Trees
Filter Chunks by Label
Extract Chunk Text
API Reference
regexpChunkParse(tokens, grammar)
Parses POS-tagged tokens into chunks using regular expression patterns.
Parameters:
tokens:TaggedToken[]- POS-tagged tokensgrammar:string- Chunk grammar rules
ChunkElement[] - Mixed array of chunks and tokens
chunkTreeToIob(tree)
Converts chunk tree to IOB format.
Parameters:
tree:ChunkElement[]- Chunk tree
IobRow[] - IOB-tagged tokens