Overview
Doc Parser provides:- Multi-format Parsing: PDF, DOCX, PPTX, TXT, HTML, CSV, TSV, XLSX
- Intelligent Chunking: Context-aware splitting with overlap
- Table Extraction: Preserves table structure as markdown
- Token Counting: Tracks token usage for each chunk
- Caching: Parsed documents are cached for efficiency
- Page Tracking: Maintains page number metadata
Registration
doc_parser
Parameters
Path to the document to parse. Can be:
- Local file path:
"/path/to/document.pdf" - HTTP(S) URL:
"https://example.com/paper.pdf"
Parameter Schema
Configuration
Maximum total tokens for all chunks. If the document is smaller, it’s returned as a single chunk.
Target size (in tokens) for each chunk. The chunking algorithm aims for this size but may vary.
Storage path for cached parsed documents. Defaults to
$DEFAULT_WORKSPACE/tools/doc_parser.Return Format
The tool returns a dictionary with the following structure:Chunk Structure
The text content of the chunk, including page markers like
[page: 1].Number of tokens in this chunk.
Metadata about the chunk:
source: Original file path or URLtitle: Document title (extracted from first page or filename)chunk_id: Sequential chunk identifier (0-indexed)
Usage
Basic Parsing
With Custom Chunk Size
Parsing Remote Documents
Accessing Chunk Content
Chunking Algorithm
The Doc Parser uses an intelligent chunking algorithm:Small Documents
If total tokens ≤max_ref_token, the entire document is returned as one chunk.
Large Documents
For documents exceedingmax_ref_token:
- Page-Based Chunking: Chunks respect page boundaries
- Paragraph-Aware: Tries not to split mid-paragraph
- Overlap: Last portion of previous chunk is included in next chunk (up to 150 characters)
- Sentence Splitting: Very long paragraphs are split at sentence boundaries
- Page Markers: Each chunk includes
[page: N]markers
Example
Supported File Types
- Text extraction with layout awareness
- Table detection and conversion to markdown
- Multi-column layout handling
- Font size detection for headers
Word (DOCX)
- Paragraph extraction
- Table conversion to markdown
- Entire document as single page
PowerPoint (PPTX)
- Each slide as separate page
- Text frames and tables extracted
- Slide order preserved
HTML
- BeautifulSoup parsing
- Title extraction
- Clean text without HTML tags
CSV / TSV / Excel
- Tables converted to markdown format
- Each sheet as separate page (Excel)
- Preserves table structure
Plain Text
- Direct text processing
- Paragraph splitting on newlines
Advanced Usage
Processing Multiple Documents
Extracting Specific Pages
Token Budget Management
Caching Behavior
Parsed documents are automatically cached:- File URL/path hash
parser_page_sizesetting
parser_page_size creates a new cache entry.
Integration with Retrieval
Doc Parser is used internally by the Retrieval tool:Data Classes
Chunk
Record
Performance Tips
Chunk Size Selection
Chunk Size Selection
Small chunks (300-500 tokens):
- ✅ Better for precise retrieval
- ✅ More granular context
- ❌ More chunks to process
- ❌ Less context per chunk
- ✅ More context preserved
- ✅ Fewer chunks to manage
- ❌ Less precise retrieval
- ❌ May exceed LLM context limits
Caching Strategy
Caching Strategy
- Parse documents once during setup
- Reuse the same
DocParserinstance - Cache location:
$DEFAULT_WORKSPACE/tools/doc_parser - Clear cache if document content changes
Large Document Handling
Large Document Handling
For very large documents:
Example: Document Analysis
Troubleshooting
Parsing errors
Parsing errors
Ensure required dependencies are installed:This installs parsers for all supported formats.
Empty chunks
Empty chunks
Some documents may have complex layouts. Try:
- Checking if the document is text-based (not scanned images)
- Using a different file format (e.g., export PDF to DOCX)
- Adjusting
parser_page_size
Table formatting issues
Table formatting issues
Tables are converted to markdown. If formatting is poor:
- Export document in a more structured format (Excel for tables)
- Manually convert tables to CSV
Out of memory
Out of memory
For very large documents:
Related
Retrieval
High-level RAG tool that uses DocParser
Simple Doc Parser
Basic parsing without chunking
Storage
Caching system used by DocParser