Overview
The corpus registry system allows you to download corpus data from remote sources (URLs) and automatically create local corpus bundles. This is useful for:- Distributing corpora without bundling them in your package
- Fetching corpora on-demand from CDNs or repositories
- Verifying corpus integrity with SHA-256 checksums
loadCorpusRegistryManifest()
Loads a corpus registry manifest from a JSON file. The manifest describes which corpora to download and their metadata.manifestPath: Path to the registry manifest JSON file
CorpusRegistryManifest object
Throws: Error if the manifest is invalid or entries array is missing
Example
downloadCorpusRegistry()
Downloads all corpus files specified in a registry manifest and creates a local corpus bundle.manifestOrPath: Either a loaded manifest object or a path to a manifest fileoutDir: Directory where corpus files andindex.jsonwill be savedoptions(optional):fetchBytes: Custom function to fetch file bytes (defaults tofetch())overwrite: Whether to overwrite existing files (currently not enforced)
index.json file
Throws:
- Error if download fails
- Error if SHA-256 checksum doesn’t match (when specified in manifest)
- Error if a downloaded file is empty
Example: Basic Download
Example: With Custom Fetch
Example: Pass Loaded Manifest
Registry Manifest Format
Type Definitions
Example Manifest
Download Process
WhendownloadCorpusRegistry() executes:
- Create output directory (if it doesn’t exist)
- For each entry in the manifest:
- Download file from
url - Validate SHA-256 checksum (if provided)
- Sanitize filename (remove unsafe characters)
- Save file to
outDir
- Download file from
- Generate index.json with file mappings
- Return path to
index.json
Generated Output Structure
Generated index.json
Filename Sanitization
Filenames are automatically sanitized to be filesystem-safe:- Only alphanumeric characters, dots, hyphens, and underscores are preserved
- All other characters are replaced with underscores
my corpus.txt→my_corpus.txt[email protected]→data_2024.txtfile/name.txt→file_name.txt
SHA-256 Verification
When asha256 field is provided in a registry entry:
- The downloaded file’s checksum is computed
- Comparison is case-insensitive
- Mismatch throws an error and stops the download process
Error Handling
Common errors:- Invalid manifest: Missing or malformed
entriesarray - Download failure: Network error or HTTP error status
- Empty file: Downloaded content is empty
- Checksum mismatch: SHA-256 doesn’t match expected value
Complete Workflow Example
See Also
- CorpusReader - CorpusReader class methods
- Bundled Corpora - Load local corpus bundles