Bundled Corpora

Overview

bun_nltk includes functions to load corpus bundles from the filesystem. Corpus bundles consist of:

An index.json file that lists all files and their metadata
Individual text files referenced by the index

loadBundledMiniCorpus()

Loads the default mini corpus bundle that ships with bun_nltk. Results are cached for subsequent calls.

function loadBundledMiniCorpus(rootPath?: string): CorpusReader

Parameters:

rootPath (optional): Custom path to a corpus bundle directory. If not provided, loads the built-in mini corpus.

Returns: A CorpusReader instance Caching: When called without rootPath, the result is cached. Subsequent calls return the cached instance for better performance.

Example: Default Mini Corpus

import { loadBundledMiniCorpus } from "bun_nltk";

const corpus = loadBundledMiniCorpus();

// Access corpus data
const words = corpus.words();
const sentences = corpus.sents();
const categories = corpus.categories();

console.log(`Loaded ${words.length} words from ${categories.length} categories`);

Example: Custom Corpus Path

import { loadBundledMiniCorpus } from "bun_nltk";

// Load from a custom directory
const customCorpus = loadBundledMiniCorpus("/path/to/custom/corpus");

const fileIds = customCorpus.fileIds();
console.log("Files:", fileIds);

loadCorpusBundleFromIndex()

Loads a corpus bundle from a custom index.json file. Use this when you have your own corpus data organized as a bundle.

function loadCorpusBundleFromIndex(indexPath: string): CorpusReader

Parameters:

indexPath: Absolute or relative path to the index.json file

Returns: A CorpusReader instance Note: File paths in the index are resolved relative to the directory containing index.json.

Example: Load Custom Index

import { loadCorpusBundleFromIndex } from "bun_nltk";

const corpus = loadCorpusBundleFromIndex("./my-corpus/index.json");

const newsWords = corpus.words({ categories: ["news"] });
console.log(`Found ${newsWords.length} words in news category`);

Example: Filtering by Category

import { loadCorpusBundleFromIndex } from "bun_nltk";

const corpus = loadCorpusBundleFromIndex("./corpora/brown/index.json");

// Get sentences from romance category only
const romanceSents = corpus.sents({ categories: ["romance"] });

// Get words from multiple categories
const mixedWords = corpus.words({ 
  categories: ["news", "editorial"] 
});

Corpus Bundle Format

A corpus bundle consists of:

index.json Structure

type CorpusMiniIndex = {
  version: number;
  files: Array<{
    id: string;
    path: string;
    categories?: string[];
  }>;
};

Example index.json

{
  "version": 1,
  "files": [
    {
      "id": "brown_news_001.txt",
      "path": "brown/news/001.txt",
      "categories": ["news"]
    },
    {
      "id": "brown_romance_001.txt",
      "path": "brown/romance/001.txt",
      "categories": ["romance"]
    }
  ]
}

Directory Structure Example

my-corpus/
├── index.json
└── brown/
    ├── news/
    │   └── 001.txt
    └── romance/
        └── 001.txt

Creating Your Own Corpus Bundle

To create a custom corpus bundle:

Organize text files in a directory structure
Create index.json with file metadata
Load with loadCorpusBundleFromIndex()

Example: Build Custom Bundle

import { writeFileSync } from "fs";
import { loadCorpusBundleFromIndex } from "bun_nltk";
import type { CorpusMiniIndex } from "bun_nltk";

// Create index
const index: CorpusMiniIndex = {
  version: 1,
  files: [
    {
      id: "doc1.txt",
      path: "texts/doc1.txt",
      categories: ["technical"]
    },
    {
      id: "doc2.txt",
      path: "texts/doc2.txt",
      categories: ["technical", "tutorial"]
    }
  ]
};

// Write index
writeFileSync("./my-corpus/index.json", JSON.stringify(index, null, 2));

// Load corpus
const corpus = loadCorpusBundleFromIndex("./my-corpus/index.json");

// Use corpus
const technicalWords = corpus.words({ categories: ["technical"] });

Performance Considerations

Caching: loadBundledMiniCorpus() caches results when called without arguments
Loading: All corpus files are loaded into memory when creating a CorpusReader
Filtering: File selection happens in memory; filtering is fast even for large category sets

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

Overview

loadBundledMiniCorpus()

Example: Default Mini Corpus

Example: Custom Corpus Path

loadCorpusBundleFromIndex()

Example: Load Custom Index

Example: Filtering by Category

Corpus Bundle Format

index.json Structure

Example index.json

Directory Structure Example

Creating Your Own Corpus Bundle

Example: Build Custom Bundle

Performance Considerations

See Also

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​Overview

​loadBundledMiniCorpus()

​Example: Default Mini Corpus

​Example: Custom Corpus Path

​loadCorpusBundleFromIndex()

​Example: Load Custom Index

​Example: Filtering by Category

​Corpus Bundle Format

​index.json Structure

​Example index.json

​Directory Structure Example

​Creating Your Own Corpus Bundle

​Example: Build Custom Bundle

​Performance Considerations

​See Also

Build docs developers (and LLMs) love

Overview

loadBundledMiniCorpus()

Example: Default Mini Corpus

Example: Custom Corpus Path

loadCorpusBundleFromIndex()

Example: Load Custom Index

Example: Filtering by Category

Corpus Bundle Format

index.json Structure

Example index.json

Directory Structure Example

Creating Your Own Corpus Bundle

Example: Build Custom Bundle

Performance Considerations

See Also