Skip to main content
The Retrieval tool implements RAG (Retrieval-Augmented Generation) to find and return relevant content from documents based on user queries. It combines document parsing, chunking, and intelligent search strategies.

Overview

Retrieval provides:
  • Multi-format Support: PDF, DOCX, PPTX, TXT, HTML, CSV, TSV, XLSX
  • Intelligent Chunking: Automatic document splitting with overlap
  • Multiple Search Strategies: Vector, keyword (BM25), hybrid, and front-page search
  • Caching: Parsed documents are cached for efficiency
  • Token Management: Controls context size for LLM input

Registration

@register_tool('retrieval')
class Retrieval(BaseTool):
    ...
Tool Name: retrieval

Parameters

query
string
required
Keywords for searching relevant content. Use both English and Chinese keywords if documents are multilingual. Separate keywords with commas.
files
array
required
List of file paths to search. Supports local file paths and HTTP(S) URLs.Example: ["path/to/doc.pdf", "https://example.com/paper.pdf"]
value
string
Content value for storing data (used internally for data storage operations).

Parameter Schema

{
  "type": "object",
  "properties": {
    "query": {
      "description": "在这里列出关键词,用逗号分隔,目的是方便在文档中匹配到相关的内容,由于文档可能多语言,关键词最好中英文都有。",
      "type": "string"
    },
    "files": {
      "description": "待解析的文件路径列表,支持本地文件路径或可下载的http(s)链接。",
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "value": {
      "description": "数据的内容,仅存数据时需要",
      "type": "string"
    }
  },
  "required": ["query", "files"]
}

Configuration

max_ref_token
int
default:4000
Maximum number of tokens to return from retrieval. Controls the size of retrieved context.
parser_page_size
int
default:500
Target size (in tokens) for each document chunk. Larger values create bigger chunks.
rag_searchers
list
default:["keyword_search"]
List of search strategies to use. Options:
  • 'vector_search' - Semantic search using embeddings
  • 'keyword_search' - BM25-based keyword matching
  • 'hybrid_search' - Combination of vector and keyword
  • 'front_page_search' - Search document metadata

Dependencies

RAG support requires additional dependencies. Install with:
pip install "qwen-agent[rag]"
Required packages:
  • charset-normalizer - Character encoding detection
  • jieba - Chinese text segmentation
  • pdfminer - PDF parsing
  • pdfplumber - PDF table extraction
  • rank-bm25 - BM25 keyword search
  • snowballstemmer - Text stemming
  • beautifulsoup4 - HTML parsing
  • python-docx - DOCX parsing
  • python-pptx - PPTX parsing

Usage

Basic Retrieval

from qwen_agent.tools import Retrieval
import json

# Initialize the tool
retrieval = Retrieval()

# Search documents
params = {
    'query': 'machine learning, 机器学习, neural networks',
    'files': ['paper.pdf', 'notes.docx']
}

results = retrieval.call(params=json.dumps(params))
print(results)

With Custom Configuration

retrieval = Retrieval(cfg={
    'max_ref_token': 8000,
    'parser_page_size': 1000,
    'rag_searchers': ['hybrid_search']
})

params = {
    'query': 'transformers, attention mechanism',
    'files': ['https://arxiv.org/pdf/1706.03762.pdf']
}

results = retrieval.call(params=json.dumps(params))

Using with Agents

from qwen_agent.agents import Assistant

bot = Assistant(
    llm={'model': 'qwen-plus-latest'},
    function_list=['retrieval']
)

messages = [
    {
        'role': 'user',
        'content': [
            {'text': '介绍图一'},
            {'file': 'https://arxiv.org/pdf/1706.03762.pdf'}
        ]
    }
]

for response in bot.run(messages=messages):
    print(response)

How It Works

The Retrieval tool operates in two stages:

Stage 1: Document Parsing

  1. File Download: Remote URLs are downloaded to local cache
  2. Format Detection: File type is determined by extension
  3. Content Extraction: Text, tables, and structure are extracted
  4. Chunking: Document is split into manageable chunks
  5. Caching: Parsed content is stored for future use
This stage is handled by the DocParser tool.

Stage 2: Search & Retrieval

  1. Query Processing: Keywords are analyzed
  2. Chunk Scoring: Each chunk is scored against the query
  3. Ranking: Top-scoring chunks are selected
  4. Token Limiting: Results are truncated to max_ref_token
  5. Formatting: Relevant chunks are returned

Search Strategies

Keyword Search (BM25)

Best for:
  • Exact term matching
  • Technical documents with specific terminology
  • When query contains unique keywords
retrieval = Retrieval(cfg={
    'rag_searchers': ['keyword_search']
})
Best for:
  • Semantic similarity
  • Conceptual queries
  • Multi-language documents
retrieval = Retrieval(cfg={
    'rag_searchers': ['vector_search']
})
Combines both methods for best results:
retrieval = Retrieval(cfg={
    'rag_searchers': ['hybrid_search']
})
Searches document titles, headers, and metadata:
retrieval = Retrieval(cfg={
    'rag_searchers': ['front_page_search']
})

Multiple Strategies

retrieval = Retrieval(cfg={
    'rag_searchers': ['vector_search', 'keyword_search']
})

Return Format

The tool returns a list of relevant chunks:
[
    {
        'content': 'Chunk text content here...',
        'metadata': {
            'source': 'path/to/document.pdf',
            'title': 'Document Title',
            'chunk_id': 0,
            'page_num': 1
        },
        'token': 234,
        'score': 0.89
    },
    # ... more chunks
]

Example: Document Q&A Agent

from qwen_agent.agents import Assistant

def create_qa_bot(documents):
    """Create a Q&A agent for specific documents."""
    bot = Assistant(
        llm={'model': 'qwen-plus-latest'},
        name='Document Q&A Assistant',
        description='Answer questions based on provided documents',
        system_message='You are a helpful assistant that answers questions based on the given documents. Always cite the source of your information.',
        function_list=['retrieval']
    )
    return bot

# Create bot
bot = create_qa_bot(['research_paper.pdf', 'documentation.html'])

# Interactive Q&A
messages = []
while True:
    query = input('Question: ')
    if query.lower() in ['exit', 'quit']:
        break
    
    messages.append({
        'role': 'user',
        'content': [
            {'text': query},
            {'file': 'research_paper.pdf'},
            {'file': 'documentation.html'}
        ]
    })
    
    response = []
    for response in bot.run(messages=messages):
        pass  # Stream responses
    
    print(f"Answer: {response[-1]['content']}")
    messages.extend(response)

Performance Optimization

Parsed documents are automatically cached. Subsequent queries on the same documents are much faster:
# First call - parses document (slow)
retrieval.call(params=json.dumps({'query': 'AI', 'files': ['large.pdf']}))

# Second call - uses cache (fast)
retrieval.call(params=json.dumps({'query': 'ML', 'files': ['large.pdf']}))
Adjust parser_page_size based on your needs:
  • Smaller chunks (300-500): Better precision, more chunks to search
  • Larger chunks (800-1200): More context per chunk, fewer chunks
retrieval = Retrieval(cfg={'parser_page_size': 800})
Set max_ref_token based on your LLM’s context window:
# For shorter context models
retrieval = Retrieval(cfg={'max_ref_token': 2000})

# For longer context models
retrieval = Retrieval(cfg={'max_ref_token': 8000})

Supported File Types

PDF

Full support including tables and multi-column layouts

Word (DOCX)

Text and tables extracted

PowerPoint (PPTX)

Slide content and tables

Plain Text (TXT)

Direct text processing

HTML

Web pages and documentation

CSV / TSV

Tabular data as markdown tables

Excel (XLSX/XLS)

All sheets with formatting preserved

Troubleshooting

Install RAG dependencies:
pip install "qwen-agent[rag]"
Increase the chunk size to reduce processing time:
retrieval = Retrieval(cfg={'parser_page_size': 1000})
Try different search strategies:
# Use hybrid search for better results
retrieval = Retrieval(cfg={'rag_searchers': ['hybrid_search']})
Also ensure your query contains relevant keywords.
Reduce the token limits:
retrieval = Retrieval(cfg={
    'max_ref_token': 2000,
    'parser_page_size': 400
})

Doc Parser

Low-level document parsing and chunking

Vector Search

Semantic search implementation

Keyword Search

BM25-based search

Hybrid Search

Combined search strategies

Build docs developers (and LLMs) love