Doc Parser - Qwen-Agent

The Doc Parser tool extracts content from various document formats and intelligently splits them into chunks suitable for RAG (Retrieval-Augmented Generation) systems.

Overview

Doc Parser provides:

Multi-format Parsing: PDF, DOCX, PPTX, TXT, HTML, CSV, TSV, XLSX
Intelligent Chunking: Context-aware splitting with overlap
Table Extraction: Preserves table structure as markdown
Token Counting: Tracks token usage for each chunk
Caching: Parsed documents are cached for efficiency
Page Tracking: Maintains page number metadata

Registration

@register_tool('doc_parser')
class DocParser(BaseTool):
    ...

Tool Name: doc_parser

Parameters

url

string

required

Path to the document to parse. Can be:

Local file path: "/path/to/document.pdf"
HTTP(S) URL: "https://example.com/paper.pdf"

Parameter Schema

{
  "type": "object",
  "properties": {
    "url": {
      "description": "待解析的文件的路径，可以是一个本地路径或可下载的http(s)链接",
      "type": "string"
    }
  },
  "required": ["url"]
}

Configuration

max_ref_token

int

default:4000

Maximum total tokens for all chunks. If the document is smaller, it’s returned as a single chunk.

parser_page_size

int

default:500

Target size (in tokens) for each chunk. The chunking algorithm aims for this size but may vary.

path

string

Storage path for cached parsed documents. Defaults to $DEFAULT_WORKSPACE/tools/doc_parser.

Return Format

The tool returns a dictionary with the following structure:

{
  "url": "path/to/document.pdf",
  "title": "Document Title",
  "raw": [
    {
      "content": "[page: 1]\nFirst chunk content here...",
      "token": 234,
      "metadata": {
        "source": "path/to/document.pdf",
        "title": "Document Title",
        "chunk_id": 0
      }
    },
    {
      "content": "[page: 1]\nSecond chunk content...",
      "token": 456,
      "metadata": {
        "source": "path/to/document.pdf",
        "title": "Document Title",
        "chunk_id": 1
      }
    }
  ]
}

Chunk Structure

content

string

The text content of the chunk, including page markers like [page: 1].

token

int

Number of tokens in this chunk.

metadata

object

Metadata about the chunk:

source: Original file path or URL
title: Document title (extracted from first page or filename)
chunk_id: Sequential chunk identifier (0-indexed)

Usage

Basic Parsing

from qwen_agent.tools import DocParser
import json

# Initialize the parser
parser = DocParser()

# Parse a document
result = parser.call(params=json.dumps({'url': 'document.pdf'}))

print(f"Title: {result['title']}")
print(f"Number of chunks: {len(result['raw'])}")
for chunk in result['raw']:
    print(f"Chunk {chunk['metadata']['chunk_id']}: {chunk['token']} tokens")

With Custom Chunk Size

parser = DocParser(cfg={
    'parser_page_size': 1000,  # Larger chunks
    'max_ref_token': 8000
})

result = parser.call(params=json.dumps({'url': 'large_document.pdf'}))

Parsing Remote Documents

result = parser.call(params=json.dumps({
    'url': 'https://arxiv.org/pdf/1706.03762.pdf'
}))

print(f"Parsed: {result['title']}")
print(f"Chunks: {len(result['raw'])}")

Accessing Chunk Content

result = parser.call(params=json.dumps({'url': 'report.pdf'}))

for chunk in result['raw']:
    print(f"--- Chunk {chunk['metadata']['chunk_id']} ---")
    print(chunk['content'][:200])  # First 200 characters
    print(f"Tokens: {chunk['token']}\n")

Chunking Algorithm

The Doc Parser uses an intelligent chunking algorithm:

Small Documents

If total tokens ≤ max_ref_token, the entire document is returned as one chunk.

Large Documents

For documents exceeding max_ref_token:

Page-Based Chunking: Chunks respect page boundaries
Paragraph-Aware: Tries not to split mid-paragraph
Overlap: Last portion of previous chunk is included in next chunk (up to 150 characters)
Sentence Splitting: Very long paragraphs are split at sentence boundaries
Page Markers: Each chunk includes [page: N] markers

Example

Chunk 0: [page: 1] Introduction paragraph... First section...
Chunk 1: [page: 1] ...First section... (overlap) Second section... [page: 2] ...
Chunk 2: [page: 2] ...Second section... (overlap) Third section...

Supported File Types

PDF

result = parser.call(params=json.dumps({'url': 'paper.pdf'}))

Features:

Text extraction with layout awareness
Table detection and conversion to markdown
Multi-column layout handling
Font size detection for headers

Word (DOCX)

result = parser.call(params=json.dumps({'url': 'report.docx'}))

Features:

Paragraph extraction
Table conversion to markdown
Entire document as single page

PowerPoint (PPTX)

result = parser.call(params=json.dumps({'url': 'presentation.pptx'}))

Features:

Each slide as separate page
Text frames and tables extracted
Slide order preserved

HTML

result = parser.call(params=json.dumps({'url': 'webpage.html'}))

Features:

BeautifulSoup parsing
Title extraction
Clean text without HTML tags

CSV / TSV / Excel

result = parser.call(params=json.dumps({'url': 'data.csv'}))

Features:

Tables converted to markdown format
Each sheet as separate page (Excel)
Preserves table structure

Plain Text

result = parser.call(params=json.dumps({'url': 'notes.txt'}))

Features:

Direct text processing
Paragraph splitting on newlines

Advanced Usage

Processing Multiple Documents

parser = DocParser()
documents = [
    'paper1.pdf',
    'paper2.pdf',
    'https://arxiv.org/pdf/1234.5678.pdf'
]

parsed_docs = []
for doc_path in documents:
    result = parser.call(params=json.dumps({'url': doc_path}))
    parsed_docs.append(result)
    print(f"Parsed {result['title']}: {len(result['raw'])} chunks")

Extracting Specific Pages

result = parser.call(params=json.dumps({'url': 'book.pdf'}))

# Find chunks from page 5
page_5_chunks = [
    chunk for chunk in result['raw']
    if '[page: 5]' in chunk['content']
]

for chunk in page_5_chunks:
    print(chunk['content'])

Token Budget Management

# Parse with strict token limit
parser = DocParser(cfg={
    'parser_page_size': 400,
    'max_ref_token': 2000
})

result = parser.call(params=json.dumps({'url': 'document.pdf'}))

# Calculate total tokens
total_tokens = sum(chunk['token'] for chunk in result['raw'])
print(f"Total tokens: {total_tokens}")

Caching Behavior

Parsed documents are automatically cached:

parser = DocParser()

# First call - parses document (slow)
import time
start = time.time()
result1 = parser.call(params=json.dumps({'url': 'large.pdf'}))
print(f"First parse: {time.time() - start:.2f}s")

# Second call - loads from cache (fast)
start = time.time()
result2 = parser.call(params=json.dumps({'url': 'large.pdf'}))
print(f"Cached load: {time.time() - start:.2f}s")

Cache key includes:

File URL/path hash
parser_page_size setting

Changing parser_page_size creates a new cache entry.

Integration with Retrieval

Doc Parser is used internally by the Retrieval tool:

from qwen_agent.tools import Retrieval

# Retrieval uses DocParser automatically
retrieval = Retrieval(cfg={
    'parser_page_size': 500,  # Passed to DocParser
    'max_ref_token': 4000
})

results = retrieval.call(params=json.dumps({
    'query': 'neural networks',
    'files': ['paper.pdf']
}))

Data Classes

Chunk

from qwen_agent.tools.doc_parser import Chunk

chunk = Chunk(
    content="Text content here",
    metadata={'source': 'doc.pdf', 'title': 'My Doc', 'chunk_id': 0},
    token=123
)

chunk_dict = chunk.to_dict()

Record

from qwen_agent.tools.doc_parser import Record

record = Record(
    url='document.pdf',
    raw=[chunk1, chunk2],
    title='Document Title'
)

record_dict = record.to_dict()

Performance Tips

Chunk Size Selection

Small chunks (300-500 tokens):

✅ Better for precise retrieval
✅ More granular context
❌ More chunks to process
❌ Less context per chunk

Large chunks (800-1200 tokens):

✅ More context preserved
✅ Fewer chunks to manage
❌ Less precise retrieval
❌ May exceed LLM context limits

Caching Strategy

Parse documents once during setup
Reuse the same DocParser instance
Cache location: $DEFAULT_WORKSPACE/tools/doc_parser
Clear cache if document content changes

Large Document Handling

For very large documents:

parser = DocParser(cfg={
    'parser_page_size': 1000,  # Larger chunks
    'max_ref_token': 10000      # Higher limit
})

Example: Document Analysis

from qwen_agent.tools import DocParser
import json

def analyze_document(file_path):
    """Analyze document structure and content."""
    parser = DocParser()
    result = parser.call(params=json.dumps({'url': file_path}))
    
    print(f"Document: {result['title']}")
    print(f"Source: {result['url']}")
    print(f"Total chunks: {len(result['raw'])}")
    
    total_tokens = sum(chunk['token'] for chunk in result['raw'])
    print(f"Total tokens: {total_tokens}")
    
    avg_tokens = total_tokens / len(result['raw']) if result['raw'] else 0
    print(f"Average tokens per chunk: {avg_tokens:.1f}")
    
    # Find largest chunk
    largest = max(result['raw'], key=lambda c: c['token'])
    print(f"Largest chunk: {largest['token']} tokens (ID: {largest['metadata']['chunk_id']})")
    
    # Show first chunk preview
    if result['raw']:
        first_chunk = result['raw'][0]
        print(f"\nFirst chunk preview:")
        print(first_chunk['content'][:300] + "...")

# Analyze a document
analyze_document('research_paper.pdf')

Troubleshooting

Parsing errors

Ensure required dependencies are installed:

pip install "qwen-agent[rag]"

This installs parsers for all supported formats.

Empty chunks

Some documents may have complex layouts. Try:

Checking if the document is text-based (not scanned images)
Using a different file format (e.g., export PDF to DOCX)
Adjusting parser_page_size

Table formatting issues

Tables are converted to markdown. If formatting is poor:

Export document in a more structured format (Excel for tables)
Manually convert tables to CSV

Out of memory

For very large documents:

parser = DocParser(cfg={
    'parser_page_size': 300,   # Smaller chunks
    'max_ref_token': 2000       # Lower limit
})

Retrieval

High-level RAG tool that uses DocParser

Simple Doc Parser

Basic parsing without chunking

Storage

Caching system used by DocParser

Get Started

Core Concepts

Guides

Built-in Agents

Built-in Tools

​Overview

​Registration

​Parameters

​Parameter Schema

​Configuration

​Return Format

​Chunk Structure

​Usage

​Basic Parsing

​With Custom Chunk Size

​Parsing Remote Documents

​Accessing Chunk Content

​Chunking Algorithm

​Small Documents

​Large Documents

​Example

​Supported File Types

​PDF

​Word (DOCX)

​PowerPoint (PPTX)

​HTML

​CSV / TSV / Excel

​Plain Text

​Advanced Usage

​Processing Multiple Documents

​Extracting Specific Pages

​Token Budget Management

​Caching Behavior

​Integration with Retrieval

​Data Classes

​Chunk

​Record

​Performance Tips

​Example: Document Analysis

​Troubleshooting

​Related

Retrieval

Simple Doc Parser

Storage

Build docs developers (and LLMs) love

Overview

Registration

Parameters

Parameter Schema

Configuration

Return Format

Chunk Structure

Usage

Basic Parsing

With Custom Chunk Size

Parsing Remote Documents

Accessing Chunk Content

Chunking Algorithm

Small Documents

Large Documents

Example

Supported File Types

PDF

Word (DOCX)

PowerPoint (PPTX)

HTML

CSV / TSV / Excel

Plain Text

Advanced Usage

Processing Multiple Documents

Extracting Specific Pages

Token Budget Management

Caching Behavior

Integration with Retrieval

Data Classes

Chunk

Record

Performance Tips

Example: Document Analysis

Troubleshooting

Related