Overview
bun_nltk includes functions to load corpus bundles from the filesystem. Corpus bundles consist of:- An
index.jsonfile that lists all files and their metadata - Individual text files referenced by the index
loadBundledMiniCorpus()
Loads the default mini corpus bundle that ships with bun_nltk. Results are cached for subsequent calls.rootPath(optional): Custom path to a corpus bundle directory. If not provided, loads the built-in mini corpus.
CorpusReader instance
Caching: When called without rootPath, the result is cached. Subsequent calls return the cached instance for better performance.
Example: Default Mini Corpus
Example: Custom Corpus Path
loadCorpusBundleFromIndex()
Loads a corpus bundle from a customindex.json file. Use this when you have your own corpus data organized as a bundle.
indexPath: Absolute or relative path to theindex.jsonfile
CorpusReader instance
Note: File paths in the index are resolved relative to the directory containing index.json.
Example: Load Custom Index
Example: Filtering by Category
Corpus Bundle Format
A corpus bundle consists of:index.json Structure
Example index.json
Directory Structure Example
Creating Your Own Corpus Bundle
To create a custom corpus bundle:- Organize text files in a directory structure
- Create index.json with file metadata
- Load with loadCorpusBundleFromIndex()
Example: Build Custom Bundle
Performance Considerations
- Caching:
loadBundledMiniCorpus()caches results when called without arguments - Loading: All corpus files are loaded into memory when creating a
CorpusReader - Filtering: File selection happens in memory; filtering is fast even for large category sets
See Also
- CorpusReader - CorpusReader class methods
- Registry - Download corpora from remote registries