Overview
Retrieval provides:- Multi-format Support: PDF, DOCX, PPTX, TXT, HTML, CSV, TSV, XLSX
- Intelligent Chunking: Automatic document splitting with overlap
- Multiple Search Strategies: Vector, keyword (BM25), hybrid, and front-page search
- Caching: Parsed documents are cached for efficiency
- Token Management: Controls context size for LLM input
Registration
retrieval
Parameters
Keywords for searching relevant content. Use both English and Chinese keywords if documents are multilingual. Separate keywords with commas.
List of file paths to search. Supports local file paths and HTTP(S) URLs.Example:
["path/to/doc.pdf", "https://example.com/paper.pdf"]Content value for storing data (used internally for data storage operations).
Parameter Schema
Configuration
Maximum number of tokens to return from retrieval. Controls the size of retrieved context.
Target size (in tokens) for each document chunk. Larger values create bigger chunks.
List of search strategies to use. Options:
'vector_search'- Semantic search using embeddings'keyword_search'- BM25-based keyword matching'hybrid_search'- Combination of vector and keyword'front_page_search'- Search document metadata
Dependencies
Required packages:charset-normalizer- Character encoding detectionjieba- Chinese text segmentationpdfminer- PDF parsingpdfplumber- PDF table extractionrank-bm25- BM25 keyword searchsnowballstemmer- Text stemmingbeautifulsoup4- HTML parsingpython-docx- DOCX parsingpython-pptx- PPTX parsing
Usage
Basic Retrieval
With Custom Configuration
Using with Agents
How It Works
The Retrieval tool operates in two stages:Stage 1: Document Parsing
- File Download: Remote URLs are downloaded to local cache
- Format Detection: File type is determined by extension
- Content Extraction: Text, tables, and structure are extracted
- Chunking: Document is split into manageable chunks
- Caching: Parsed content is stored for future use
Stage 2: Search & Retrieval
- Query Processing: Keywords are analyzed
- Chunk Scoring: Each chunk is scored against the query
- Ranking: Top-scoring chunks are selected
- Token Limiting: Results are truncated to
max_ref_token - Formatting: Relevant chunks are returned
Search Strategies
Keyword Search (BM25)
Best for:- Exact term matching
- Technical documents with specific terminology
- When query contains unique keywords
Vector Search
Best for:- Semantic similarity
- Conceptual queries
- Multi-language documents
Hybrid Search
Combines both methods for best results:Front Page Search
Searches document titles, headers, and metadata:Multiple Strategies
Return Format
The tool returns a list of relevant chunks:Example: Document Q&A Agent
Performance Optimization
Caching
Caching
Parsed documents are automatically cached. Subsequent queries on the same documents are much faster:
Chunk Size Tuning
Chunk Size Tuning
Adjust
parser_page_size based on your needs:- Smaller chunks (300-500): Better precision, more chunks to search
- Larger chunks (800-1200): More context per chunk, fewer chunks
Token Limits
Token Limits
Set
max_ref_token based on your LLM’s context window:Supported File Types
Full support including tables and multi-column layouts
Word (DOCX)
Text and tables extracted
PowerPoint (PPTX)
Slide content and tables
Plain Text (TXT)
Direct text processing
HTML
Web pages and documentation
CSV / TSV
Tabular data as markdown tables
Excel (XLSX/XLS)
All sheets with formatting preserved
Troubleshooting
Missing dependencies error
Missing dependencies error
Install RAG dependencies:
Large documents timeout
Large documents timeout
Increase the chunk size to reduce processing time:
Poor retrieval quality
Poor retrieval quality
Try different search strategies:Also ensure your query contains relevant keywords.
Out of memory
Out of memory
Reduce the token limits:
Related
Doc Parser
Low-level document parsing and chunking
Vector Search
Semantic search implementation
Keyword Search
BM25-based search
Hybrid Search
Combined search strategies