Retrieval (RAG) - Qwen-Agent

The Retrieval tool implements RAG (Retrieval-Augmented Generation) to find and return relevant content from documents based on user queries. It combines document parsing, chunking, and intelligent search strategies.

Overview

Retrieval provides:

Multi-format Support: PDF, DOCX, PPTX, TXT, HTML, CSV, TSV, XLSX
Intelligent Chunking: Automatic document splitting with overlap
Multiple Search Strategies: Vector, keyword (BM25), hybrid, and front-page search
Caching: Parsed documents are cached for efficiency
Token Management: Controls context size for LLM input

Registration

@register_tool('retrieval')
class Retrieval(BaseTool):
    ...

Tool Name: retrieval

Parameters

query

string

required

Keywords for searching relevant content. Use both English and Chinese keywords if documents are multilingual. Separate keywords with commas.

files

array

required

List of file paths to search. Supports local file paths and HTTP(S) URLs.Example: ["path/to/doc.pdf", "https://example.com/paper.pdf"]

value

string

Content value for storing data (used internally for data storage operations).

Parameter Schema

{
  "type": "object",
  "properties": {
    "query": {
      "description": "在这里列出关键词，用逗号分隔，目的是方便在文档中匹配到相关的内容，由于文档可能多语言，关键词最好中英文都有。",
      "type": "string"
    },
    "files": {
      "description": "待解析的文件路径列表，支持本地文件路径或可下载的http(s)链接。",
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "value": {
      "description": "数据的内容，仅存数据时需要",
      "type": "string"
    }
  },
  "required": ["query", "files"]
}

Configuration

max_ref_token

int

default:4000

Maximum number of tokens to return from retrieval. Controls the size of retrieved context.

parser_page_size

int

default:500

Target size (in tokens) for each document chunk. Larger values create bigger chunks.

rag_searchers

list

default:["keyword_search"]

List of search strategies to use. Options:

'vector_search' - Semantic search using embeddings
'keyword_search' - BM25-based keyword matching
'hybrid_search' - Combination of vector and keyword
'front_page_search' - Search document metadata

Dependencies

RAG support requires additional dependencies. Install with:

pip install "qwen-agent[rag]"

Required packages:

charset-normalizer - Character encoding detection
jieba - Chinese text segmentation
pdfminer - PDF parsing
pdfplumber - PDF table extraction
rank-bm25 - BM25 keyword search
snowballstemmer - Text stemming
beautifulsoup4 - HTML parsing
python-docx - DOCX parsing
python-pptx - PPTX parsing

Usage

Basic Retrieval

from qwen_agent.tools import Retrieval
import json

# Initialize the tool
retrieval = Retrieval()

# Search documents
params = {
    'query': 'machine learning, 机器学习, neural networks',
    'files': ['paper.pdf', 'notes.docx']
}

results = retrieval.call(params=json.dumps(params))
print(results)

With Custom Configuration

retrieval = Retrieval(cfg={
    'max_ref_token': 8000,
    'parser_page_size': 1000,
    'rag_searchers': ['hybrid_search']
})

params = {
    'query': 'transformers, attention mechanism',
    'files': ['https://arxiv.org/pdf/1706.03762.pdf']
}

results = retrieval.call(params=json.dumps(params))

Using with Agents

from qwen_agent.agents import Assistant

bot = Assistant(
    llm={'model': 'qwen-plus-latest'},
    function_list=['retrieval']
)

messages = [
    {
        'role': 'user',
        'content': [
            {'text': '介绍图一'},
            {'file': 'https://arxiv.org/pdf/1706.03762.pdf'}
        ]
    }
]

for response in bot.run(messages=messages):
    print(response)

How It Works

The Retrieval tool operates in two stages:

Stage 1: Document Parsing

File Download: Remote URLs are downloaded to local cache
Format Detection: File type is determined by extension
Content Extraction: Text, tables, and structure are extracted
Chunking: Document is split into manageable chunks
Caching: Parsed content is stored for future use

This stage is handled by the DocParser tool.

Stage 2: Search & Retrieval

Query Processing: Keywords are analyzed
Chunk Scoring: Each chunk is scored against the query
Ranking: Top-scoring chunks are selected
Token Limiting: Results are truncated to max_ref_token
Formatting: Relevant chunks are returned

Search Strategies

Keyword Search (BM25)

Best for:

Exact term matching
Technical documents with specific terminology
When query contains unique keywords

retrieval = Retrieval(cfg={
    'rag_searchers': ['keyword_search']
})

Vector Search

Best for:

Semantic similarity
Conceptual queries
Multi-language documents

retrieval = Retrieval(cfg={
    'rag_searchers': ['vector_search']
})

Hybrid Search

Combines both methods for best results:

retrieval = Retrieval(cfg={
    'rag_searchers': ['hybrid_search']
})

Front Page Search

Searches document titles, headers, and metadata:

retrieval = Retrieval(cfg={
    'rag_searchers': ['front_page_search']
})

Multiple Strategies

retrieval = Retrieval(cfg={
    'rag_searchers': ['vector_search', 'keyword_search']
})

Return Format

The tool returns a list of relevant chunks:

[
    {
        'content': 'Chunk text content here...',
        'metadata': {
            'source': 'path/to/document.pdf',
            'title': 'Document Title',
            'chunk_id': 0,
            'page_num': 1
        },
        'token': 234,
        'score': 0.89
    },
    # ... more chunks
]

Example: Document Q&A Agent

from qwen_agent.agents import Assistant

def create_qa_bot(documents):
    """Create a Q&A agent for specific documents."""
    bot = Assistant(
        llm={'model': 'qwen-plus-latest'},
        name='Document Q&A Assistant',
        description='Answer questions based on provided documents',
        system_message='You are a helpful assistant that answers questions based on the given documents. Always cite the source of your information.',
        function_list=['retrieval']
    )
    return bot

# Create bot
bot = create_qa_bot(['research_paper.pdf', 'documentation.html'])

# Interactive Q&A
messages = []
while True:
    query = input('Question: ')
    if query.lower() in ['exit', 'quit']:
        break
    
    messages.append({
        'role': 'user',
        'content': [
            {'text': query},
            {'file': 'research_paper.pdf'},
            {'file': 'documentation.html'}
        ]
    })
    
    response = []
    for response in bot.run(messages=messages):
        pass  # Stream responses
    
    print(f"Answer: {response[-1]['content']}")
    messages.extend(response)

Performance Optimization

Caching

Parsed documents are automatically cached. Subsequent queries on the same documents are much faster:

# First call - parses document (slow)
retrieval.call(params=json.dumps({'query': 'AI', 'files': ['large.pdf']}))

# Second call - uses cache (fast)
retrieval.call(params=json.dumps({'query': 'ML', 'files': ['large.pdf']}))

Chunk Size Tuning

Adjust parser_page_size based on your needs:

Smaller chunks (300-500): Better precision, more chunks to search
Larger chunks (800-1200): More context per chunk, fewer chunks

retrieval = Retrieval(cfg={'parser_page_size': 800})

Token Limits

Set max_ref_token based on your LLM’s context window:

# For shorter context models
retrieval = Retrieval(cfg={'max_ref_token': 2000})

# For longer context models
retrieval = Retrieval(cfg={'max_ref_token': 8000})

Supported File Types

PDF

Full support including tables and multi-column layouts

Word (DOCX)

Text and tables extracted

PowerPoint (PPTX)

Slide content and tables

Plain Text (TXT)

Direct text processing

HTML

Web pages and documentation

CSV / TSV

Tabular data as markdown tables

Excel (XLSX/XLS)

All sheets with formatting preserved

Troubleshooting

Missing dependencies error

Install RAG dependencies:

pip install "qwen-agent[rag]"

Large documents timeout

Increase the chunk size to reduce processing time:

retrieval = Retrieval(cfg={'parser_page_size': 1000})

Poor retrieval quality

Try different search strategies:

# Use hybrid search for better results
retrieval = Retrieval(cfg={'rag_searchers': ['hybrid_search']})

Also ensure your query contains relevant keywords.

Out of memory

Reduce the token limits:

retrieval = Retrieval(cfg={
    'max_ref_token': 2000,
    'parser_page_size': 400
})

Doc Parser

Low-level document parsing and chunking

Vector Search

Semantic search implementation

Keyword Search

BM25-based search

Hybrid Search

Combined search strategies

Get Started

Core Concepts

Guides

Built-in Agents

Built-in Tools

​Overview

​Registration

​Parameters

​Parameter Schema

​Configuration

​Dependencies

​Usage

​Basic Retrieval

​With Custom Configuration

​Using with Agents

​How It Works

​Stage 1: Document Parsing

​Stage 2: Search & Retrieval

​Search Strategies

​Keyword Search (BM25)

​Vector Search

​Hybrid Search

​Front Page Search

​Multiple Strategies

​Return Format

​Example: Document Q&A Agent

​Performance Optimization

​Supported File Types

PDF

Word (DOCX)

PowerPoint (PPTX)

Plain Text (TXT)

HTML

CSV / TSV

Excel (XLSX/XLS)

​Troubleshooting

​Related

Doc Parser

Vector Search

Keyword Search

Hybrid Search

Build docs developers (and LLMs) love

Overview

Registration

Parameters

Parameter Schema

Configuration

Dependencies

Usage

Basic Retrieval

With Custom Configuration

Using with Agents

How It Works

Stage 1: Document Parsing

Stage 2: Search & Retrieval

Search Strategies

Keyword Search (BM25)

Vector Search

Hybrid Search

Front Page Search

Multiple Strategies

Return Format

Example: Document Q&A Agent

Performance Optimization

Supported File Types

Troubleshooting

Related