Overview
bun_nltk provides corpus readers for loading and processing text collections. The system supports bundled corpora, custom collections, and a registry for downloading remote corpora.Quick Start
Load Bundled Mini Corpus
Accessing Corpus Data
File IDs
Raw Text
Words
Sentences
Paragraphs
Categories
Creating Custom Corpora
From In-Memory Data
From Index File
Corpus Registry System
Registry Manifest
Define downloadable corpora:Download Corpus from Registry
Custom Fetch Function
Load Registry Manifest
Filtering Options
All corpus methods accept filtering options:Examples
Practical Examples
Word Frequency Analysis
Category Comparison
Build Training Data
Extract Sentences by Length
Create Vocabulary List
Performance Notes
- Bundled corpus is cached (singleton pattern)
- File reading is lazy (only loads when accessed)
- Tokenization uses optimized Punkt for sentences
- Word tokenization uses
wordTokenizeSubset
Type Definitions
API Reference
Loading Functions
loadBundledMiniCorpus(rootPath?)- Load bundled corpusloadCorpusBundleFromIndex(indexPath)- Load from index fileloadCorpusRegistryManifest(manifestPath)- Load registry manifestdownloadCorpusRegistry(manifest, outDir, options?)- Download corpus
CorpusReader Methods
fileIds(options?)- Get file IDsraw(options?)- Get raw textwords(options?)- Get tokenized wordssents(options?)- Get sentencesparas(options?)- Get paragraphscategories()- Get all categories