CorpusReader
TheCorpusReader class provides a unified interface for accessing corpus files and their contents. It supports filtering by file IDs and categories, and offers multiple ways to extract text data.
Constructor
files: Array ofCorpusFileobjects
Methods
fileIds()
Returns an array of file IDs in the corpus, optionally filtered by file IDs or categories.options: Optional filtering optionsfileIds: Array of file IDs to includecategories: Array of categories to filter by
raw()
Returns the raw text content from selected files, joined with newlines.options: Optional filtering options (same asfileIds())
words()
Extracts and tokenizes words from the corpus text. Words are converted to lowercase.options: Optional filtering options (same asfileIds())
sents()
Extracts sentences from the corpus using Punkt sentence tokenizer. Empty sentences are filtered out.options: Optional filtering options (same asfileIds())
paras()
Extracts paragraphs from the corpus. Paragraphs are identified by double newlines (\n\n or \r\n\r\n).
options: Optional filtering options (same asfileIds())
categories()
Returns all unique categories present in the corpus, sorted alphabetically.Filtering Behavior
All methods that acceptReadOptions support flexible filtering:
- File IDs: Case-insensitive matching
- Categories: Case-insensitive matching; files must have at least one matching category
- Multiple filters: When both
fileIdsandcategoriesare provided, files must match both criteria - Results: Always sorted alphabetically by file ID
See Also
- Bundled Corpora - Loading corpus bundles
- Registry - Downloading corpora from remote registries