Overview
Preprocessing transforms raw documents into chunks before extraction and embedding. REMem provides a flexibleTextPreprocessor class that supports multiple chunking strategies and custom text preprocessing functions.
Base Preprocessor Interface
All preprocessors inherit fromBasePreprocessor (graph/preprocessing/base.py:20-45):
Built-in Chunking Strategies
Configure viaBaseConfig.preprocess_chunk_func:
| Strategy | Function | Use Case |
|---|---|---|
by_token | chunk_by_token_count() | Fixed token-size chunks with overlap |
by_word | chunk_by_word_count() | Word-based chunks respecting sentence boundaries |
by_message | chunk_by_message_and_token_count() | Chat/conversation data |
by_session | chunk_by_session() | Session-based grouping by date |
none | No chunking | Single chunk per document |
Example: Token-based Chunking
Fromtext_preprocessing.py:28-63:
Example: Word-based Chunking
Fromtext_preprocessing.py:66-133:
TextPreprocessor Class
The main preprocessor (text_preprocessing.py:207-353):
Creating Custom Chunking Strategies
1. Define Your Chunking Function
2. Integrate into TextPreprocessor
Extend theTextPreprocessor class:
3. Use Your Custom Preprocessor
Custom Text Preprocessing Functions
The default function (text_preprocessing.py:201-205):
Define Custom Normalization
Async Preprocessing
For large-scale processing:Configuration Reference
Token-based chunking:Next Steps
- Custom Extraction - Process chunks into structured data
- Custom RAG Strategies - Use chunks in retrieval
- Architecture - Understand preprocessing in the pipeline