Skip to main content

Overview

DocParser extracts content from documents and splits them into manageable chunks for processing.

Class Signature

from qwen_agent.tools import DocParser

class DocParser(BaseTool):
    name = 'doc_parser'
    description = 'Extract and chunk document content'
    parameters = {
        'type': 'object',
        'properties': {
            'url': {
                'description': 'File path or downloadable URL',
                'type': 'string',
            }
        },
        'required': ['url'],
    }

Parameters

url
str
required
Document path (local or URL)
parser_page_size
int
default:"500"
Target chunk size in tokens
max_ref_token
int
default:"4000"
Maximum tokens before chunking (if below, returns whole document)

Usage Example

from qwen_agent.tools import DocParser

tool = DocParser()

result = tool.call({
    'url': 'research_paper.pdf'
}, parser_page_size=1000)

print(result)
# Returns:
# {
#     'url': 'research_paper.pdf',
#     'title': 'Research Paper Title',
#     'raw': [
#         {'content': 'Chunk 1...', 'token': 950, 'metadata': {...}},
#         {'content': 'Chunk 2...', 'token': 1000, 'metadata': {...}}
#     ]
# }

Output Format

{
    'url': 'document.pdf',
    'title': 'Document Title',
    'raw': [
        {
            'content': 'Text content...',
            'token': 500,
            'metadata': {
                'source': 'document.pdf',
                'title': 'Document Title',
                'chunk_id': 0
            }
        },
        ...
    ]
}

See Also

Build docs developers (and LLMs) love