CorpusReader

The CorpusReader class provides a unified interface for accessing corpus files and their contents. It supports filtering by file IDs and categories, and offers multiple ways to extract text data.

Constructor

new CorpusReader(files: CorpusFile[])

Creates a new corpus reader from an array of corpus files. Parameters:

files: Array of CorpusFile objects

Types:

type CorpusFile = {
  id: string;
  text: string;
  categories: string[];
};

type ReadOptions = {
  fileIds?: string[];
  categories?: string[];
};

Methods

fileIds()

Returns an array of file IDs in the corpus, optionally filtered by file IDs or categories.

fileIds(options?: ReadOptions): string[]

Parameters:

options: Optional filtering options
- fileIds: Array of file IDs to include
- categories: Array of categories to filter by

Returns: Array of file ID strings Example:

import { loadBundledMiniCorpus } from "bun_nltk";

const corpus = loadBundledMiniCorpus();

// Get all file IDs
const allIds = corpus.fileIds();
console.log(allIds);
// => ["brown_news_001.txt", "brown_romance_001.txt", ...]

// Get file IDs by category
const newsIds = corpus.fileIds({ categories: ["news"] });
console.log(newsIds);
// => ["brown_news_001.txt", ...]

raw()

Returns the raw text content from selected files, joined with newlines.

raw(options?: ReadOptions): string

Parameters:

options: Optional filtering options (same as fileIds())

Returns: Concatenated text string Example:

const corpus = loadBundledMiniCorpus();

// Get all raw text
const allText = corpus.raw();

// Get raw text from specific category
const newsText = corpus.raw({ categories: ["news"] });

words()

Extracts and tokenizes words from the corpus text. Words are converted to lowercase.

words(options?: ReadOptions): string[]

Parameters:

options: Optional filtering options (same as fileIds())

Returns: Array of lowercase word tokens Example:

const corpus = loadBundledMiniCorpus();

// Get all words
const allWords = corpus.words();
console.log(allWords.slice(0, 5));
// => ["the", "fulton", "county", "grand", "jury"]

// Get words from specific files
const fileWords = corpus.words({ fileIds: ["brown_news_001.txt"] });

sents()

Extracts sentences from the corpus using Punkt sentence tokenizer. Empty sentences are filtered out.

sents(options?: ReadOptions): string[]

Parameters:

options: Optional filtering options (same as fileIds())

Returns: Array of sentence strings (trimmed) Example:

const corpus = loadBundledMiniCorpus();

// Get all sentences
const sentences = corpus.sents();
console.log(sentences[0]);
// => "The Fulton County Grand Jury said Friday an investigation..."

// Get sentences by category
const newsSents = corpus.sents({ categories: ["news"] });

paras()

Extracts paragraphs from the corpus. Paragraphs are identified by double newlines (\n\n or \r\n\r\n).

paras(options?: ReadOptions): string[]

Parameters:

options: Optional filtering options (same as fileIds())

Returns: Array of paragraph strings (trimmed, non-empty) Example:

const corpus = loadBundledMiniCorpus();

// Get all paragraphs
const paragraphs = corpus.paras();

// Get paragraphs from specific category
const romanceParas = corpus.paras({ categories: ["romance"] });

categories()

Returns all unique categories present in the corpus, sorted alphabetically.

categories(): string[]

Returns: Array of category strings (lowercase, sorted) Example:

const corpus = loadBundledMiniCorpus();

const cats = corpus.categories();
console.log(cats);
// => ["news", "romance", ...]

Filtering Behavior

All methods that accept ReadOptions support flexible filtering:

File IDs: Case-insensitive matching
Categories: Case-insensitive matching; files must have at least one matching category
Multiple filters: When both fileIds and categories are provided, files must match both criteria
Results: Always sorted alphabetically by file ID

Example:

const corpus = loadBundledMiniCorpus();

// Combine file IDs and categories
const filtered = corpus.words({
  fileIds: ["brown_news_001.txt", "brown_romance_001.txt"],
  categories: ["news"]
});
// Only returns words from brown_news_001.txt

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

CorpusReader

CorpusReader

Constructor

Methods

fileIds()

raw()

words()

sents()

paras()

categories()

Filtering Behavior

See Also

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​CorpusReader

​Constructor

​Methods

​fileIds()

​raw()

​words()

​sents()

​paras()

​categories()

​Filtering Behavior

​See Also

Build docs developers (and LLMs) love

CorpusReader

Constructor

Methods

fileIds()

raw()

words()

sents()

paras()

categories()

Filtering Behavior

See Also