Skip to main content
The Doc Parser tool extracts content from various document formats and intelligently splits them into chunks suitable for RAG (Retrieval-Augmented Generation) systems.

Overview

Doc Parser provides:
  • Multi-format Parsing: PDF, DOCX, PPTX, TXT, HTML, CSV, TSV, XLSX
  • Intelligent Chunking: Context-aware splitting with overlap
  • Table Extraction: Preserves table structure as markdown
  • Token Counting: Tracks token usage for each chunk
  • Caching: Parsed documents are cached for efficiency
  • Page Tracking: Maintains page number metadata

Registration

@register_tool('doc_parser')
class DocParser(BaseTool):
    ...
Tool Name: doc_parser

Parameters

url
string
required
Path to the document to parse. Can be:
  • Local file path: "/path/to/document.pdf"
  • HTTP(S) URL: "https://example.com/paper.pdf"

Parameter Schema

{
  "type": "object",
  "properties": {
    "url": {
      "description": "待解析的文件的路径,可以是一个本地路径或可下载的http(s)链接",
      "type": "string"
    }
  },
  "required": ["url"]
}

Configuration

max_ref_token
int
default:4000
Maximum total tokens for all chunks. If the document is smaller, it’s returned as a single chunk.
parser_page_size
int
default:500
Target size (in tokens) for each chunk. The chunking algorithm aims for this size but may vary.
path
string
Storage path for cached parsed documents. Defaults to $DEFAULT_WORKSPACE/tools/doc_parser.

Return Format

The tool returns a dictionary with the following structure:
{
  "url": "path/to/document.pdf",
  "title": "Document Title",
  "raw": [
    {
      "content": "[page: 1]\nFirst chunk content here...",
      "token": 234,
      "metadata": {
        "source": "path/to/document.pdf",
        "title": "Document Title",
        "chunk_id": 0
      }
    },
    {
      "content": "[page: 1]\nSecond chunk content...",
      "token": 456,
      "metadata": {
        "source": "path/to/document.pdf",
        "title": "Document Title",
        "chunk_id": 1
      }
    }
  ]
}

Chunk Structure

content
string
The text content of the chunk, including page markers like [page: 1].
token
int
Number of tokens in this chunk.
metadata
object
Metadata about the chunk:
  • source: Original file path or URL
  • title: Document title (extracted from first page or filename)
  • chunk_id: Sequential chunk identifier (0-indexed)

Usage

Basic Parsing

from qwen_agent.tools import DocParser
import json

# Initialize the parser
parser = DocParser()

# Parse a document
result = parser.call(params=json.dumps({'url': 'document.pdf'}))

print(f"Title: {result['title']}")
print(f"Number of chunks: {len(result['raw'])}")
for chunk in result['raw']:
    print(f"Chunk {chunk['metadata']['chunk_id']}: {chunk['token']} tokens")

With Custom Chunk Size

parser = DocParser(cfg={
    'parser_page_size': 1000,  # Larger chunks
    'max_ref_token': 8000
})

result = parser.call(params=json.dumps({'url': 'large_document.pdf'}))

Parsing Remote Documents

result = parser.call(params=json.dumps({
    'url': 'https://arxiv.org/pdf/1706.03762.pdf'
}))

print(f"Parsed: {result['title']}")
print(f"Chunks: {len(result['raw'])}")

Accessing Chunk Content

result = parser.call(params=json.dumps({'url': 'report.pdf'}))

for chunk in result['raw']:
    print(f"--- Chunk {chunk['metadata']['chunk_id']} ---")
    print(chunk['content'][:200])  # First 200 characters
    print(f"Tokens: {chunk['token']}\n")

Chunking Algorithm

The Doc Parser uses an intelligent chunking algorithm:

Small Documents

If total tokens ≤ max_ref_token, the entire document is returned as one chunk.

Large Documents

For documents exceeding max_ref_token:
  1. Page-Based Chunking: Chunks respect page boundaries
  2. Paragraph-Aware: Tries not to split mid-paragraph
  3. Overlap: Last portion of previous chunk is included in next chunk (up to 150 characters)
  4. Sentence Splitting: Very long paragraphs are split at sentence boundaries
  5. Page Markers: Each chunk includes [page: N] markers

Example

Chunk 0: [page: 1] Introduction paragraph... First section...
Chunk 1: [page: 1] ...First section... (overlap) Second section... [page: 2] ...
Chunk 2: [page: 2] ...Second section... (overlap) Third section...

Supported File Types

PDF

result = parser.call(params=json.dumps({'url': 'paper.pdf'}))
Features:
  • Text extraction with layout awareness
  • Table detection and conversion to markdown
  • Multi-column layout handling
  • Font size detection for headers

Word (DOCX)

result = parser.call(params=json.dumps({'url': 'report.docx'}))
Features:
  • Paragraph extraction
  • Table conversion to markdown
  • Entire document as single page

PowerPoint (PPTX)

result = parser.call(params=json.dumps({'url': 'presentation.pptx'}))
Features:
  • Each slide as separate page
  • Text frames and tables extracted
  • Slide order preserved

HTML

result = parser.call(params=json.dumps({'url': 'webpage.html'}))
Features:
  • BeautifulSoup parsing
  • Title extraction
  • Clean text without HTML tags

CSV / TSV / Excel

result = parser.call(params=json.dumps({'url': 'data.csv'}))
Features:
  • Tables converted to markdown format
  • Each sheet as separate page (Excel)
  • Preserves table structure

Plain Text

result = parser.call(params=json.dumps({'url': 'notes.txt'}))
Features:
  • Direct text processing
  • Paragraph splitting on newlines

Advanced Usage

Processing Multiple Documents

parser = DocParser()
documents = [
    'paper1.pdf',
    'paper2.pdf',
    'https://arxiv.org/pdf/1234.5678.pdf'
]

parsed_docs = []
for doc_path in documents:
    result = parser.call(params=json.dumps({'url': doc_path}))
    parsed_docs.append(result)
    print(f"Parsed {result['title']}: {len(result['raw'])} chunks")

Extracting Specific Pages

result = parser.call(params=json.dumps({'url': 'book.pdf'}))

# Find chunks from page 5
page_5_chunks = [
    chunk for chunk in result['raw']
    if '[page: 5]' in chunk['content']
]

for chunk in page_5_chunks:
    print(chunk['content'])

Token Budget Management

# Parse with strict token limit
parser = DocParser(cfg={
    'parser_page_size': 400,
    'max_ref_token': 2000
})

result = parser.call(params=json.dumps({'url': 'document.pdf'}))

# Calculate total tokens
total_tokens = sum(chunk['token'] for chunk in result['raw'])
print(f"Total tokens: {total_tokens}")

Caching Behavior

Parsed documents are automatically cached:
parser = DocParser()

# First call - parses document (slow)
import time
start = time.time()
result1 = parser.call(params=json.dumps({'url': 'large.pdf'}))
print(f"First parse: {time.time() - start:.2f}s")

# Second call - loads from cache (fast)
start = time.time()
result2 = parser.call(params=json.dumps({'url': 'large.pdf'}))
print(f"Cached load: {time.time() - start:.2f}s")
Cache key includes:
  • File URL/path hash
  • parser_page_size setting
Changing parser_page_size creates a new cache entry.

Integration with Retrieval

Doc Parser is used internally by the Retrieval tool:
from qwen_agent.tools import Retrieval

# Retrieval uses DocParser automatically
retrieval = Retrieval(cfg={
    'parser_page_size': 500,  # Passed to DocParser
    'max_ref_token': 4000
})

results = retrieval.call(params=json.dumps({
    'query': 'neural networks',
    'files': ['paper.pdf']
}))

Data Classes

Chunk

from qwen_agent.tools.doc_parser import Chunk

chunk = Chunk(
    content="Text content here",
    metadata={'source': 'doc.pdf', 'title': 'My Doc', 'chunk_id': 0},
    token=123
)

chunk_dict = chunk.to_dict()

Record

from qwen_agent.tools.doc_parser import Record

record = Record(
    url='document.pdf',
    raw=[chunk1, chunk2],
    title='Document Title'
)

record_dict = record.to_dict()

Performance Tips

Small chunks (300-500 tokens):
  • ✅ Better for precise retrieval
  • ✅ More granular context
  • ❌ More chunks to process
  • ❌ Less context per chunk
Large chunks (800-1200 tokens):
  • ✅ More context preserved
  • ✅ Fewer chunks to manage
  • ❌ Less precise retrieval
  • ❌ May exceed LLM context limits
  • Parse documents once during setup
  • Reuse the same DocParser instance
  • Cache location: $DEFAULT_WORKSPACE/tools/doc_parser
  • Clear cache if document content changes
For very large documents:
parser = DocParser(cfg={
    'parser_page_size': 1000,  # Larger chunks
    'max_ref_token': 10000      # Higher limit
})

Example: Document Analysis

from qwen_agent.tools import DocParser
import json

def analyze_document(file_path):
    """Analyze document structure and content."""
    parser = DocParser()
    result = parser.call(params=json.dumps({'url': file_path}))
    
    print(f"Document: {result['title']}")
    print(f"Source: {result['url']}")
    print(f"Total chunks: {len(result['raw'])}")
    
    total_tokens = sum(chunk['token'] for chunk in result['raw'])
    print(f"Total tokens: {total_tokens}")
    
    avg_tokens = total_tokens / len(result['raw']) if result['raw'] else 0
    print(f"Average tokens per chunk: {avg_tokens:.1f}")
    
    # Find largest chunk
    largest = max(result['raw'], key=lambda c: c['token'])
    print(f"Largest chunk: {largest['token']} tokens (ID: {largest['metadata']['chunk_id']})")
    
    # Show first chunk preview
    if result['raw']:
        first_chunk = result['raw'][0]
        print(f"\nFirst chunk preview:")
        print(first_chunk['content'][:300] + "...")

# Analyze a document
analyze_document('research_paper.pdf')

Troubleshooting

Ensure required dependencies are installed:
pip install "qwen-agent[rag]"
This installs parsers for all supported formats.
Some documents may have complex layouts. Try:
  • Checking if the document is text-based (not scanned images)
  • Using a different file format (e.g., export PDF to DOCX)
  • Adjusting parser_page_size
Tables are converted to markdown. If formatting is poor:
  • Export document in a more structured format (Excel for tables)
  • Manually convert tables to CSV
For very large documents:
parser = DocParser(cfg={
    'parser_page_size': 300,   # Smaller chunks
    'max_ref_token': 2000       # Lower limit
})

Retrieval

High-level RAG tool that uses DocParser

Simple Doc Parser

Basic parsing without chunking

Storage

Caching system used by DocParser

Build docs developers (and LLMs) love