Skip to main content

Overview

Qwen models support extended context lengths up to 32K tokens, enabling processing of long documents, extensive conversations, and large codebases. Different model sizes support different context lengths:
ModelMax Context LengthSpecial Features
Qwen-1.8B32KSystem prompt support
Qwen-7B32KExtended from 8K
Qwen-14B8KStandard context
Qwen-72B32KSystem prompt support

Context Extension Techniques

Qwen employs several advanced techniques to extend context length effectively:

NTK-Aware Interpolation

NTK (Neural Tangent Kernel) aware interpolation adapts the positional encoding to longer sequences without degrading performance on shorter sequences.

Window Attention

Window attention mechanisms allow the model to efficiently process longer sequences by focusing on relevant segments.

LogN Attention Scaling

Logarithmic scaling of attention scores helps maintain stable training and inference across different context lengths.

RoPE with Extended Base

For Qwen-72B, we adapt Rotary Position Embeddings (RoPE) with a larger rotary base to support 32K tokens:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Qwen-72B automatically handles 32K context
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-72B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-72B-Chat",
    trust_remote_code=True
)

# No special configuration needed for long context
response, _ = model.chat(tokenizer, long_text_query, history=None)

Perplexity Performance

We evaluated Qwen models on the arXiv dataset with different context lengths:
Context LengthPerplexity
1K4.03
2K3.78
4K3.58
8K3.53
16K3.45
32K3.43
Qwen-7B maintains strong performance up to 32K tokens with minimal perplexity increase.

Long Context Understanding Evaluation

Qwen-72B-Chat was evaluated on L-Eval benchmark for long text understanding:
ModelContext LengthAverageCourseraGSMQuALITYTOEFLCodeUSFcitionAVG w/o Code
Qwen-72B-Chat32K62.3058.1376.0077.2286.246.6669.53
GPT-3.5-Turbo-16K16K54.1960.0369.0061.8378.4311.5863.01
Claude-1.3100K60.1466.6184.0072.6575.366.1163.36
Qwen-72B-Chat demonstrates excellent information retrieval across all positions within its 32K context window, proving its robust long context capabilities.

Using Long Context in Practice

Processing Long Documents

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

# Load long document
with open('long_document.txt', 'r') as f:
    document = f.read()

# Ask questions about the document
query = f"""Based on this document:

{document}

Question: What are the main conclusions?"""

response, _ = model.chat(tokenizer, query, history=None)
print(response)

Multi-Document Analysis

# Combine multiple documents
documents = []
for i in range(5):
    with open(f'document_{i}.txt', 'r') as f:
        documents.append(f.read())

combined = "\n\n---\n\n".join([
    f"Document {i+1}:\n{doc}"
    for i, doc in enumerate(documents)
])

query = f"""{combined}

Please provide a comprehensive summary of all documents above."""

response, _ = model.chat(tokenizer, query, history=None)

Extended Conversations

# Maintain long conversation history
history = []

for turn in range(50):  # Many conversation turns
    user_input = get_user_input()
    response, history = model.chat(tokenizer, user_input, history=history)
    print(f"Qwen: {response}")
    
    # Check context length
    context_tokens = sum(len(tokenizer.encode(msg)) for msg in history)
    print(f"Context tokens: {context_tokens}")
    
    if context_tokens > 28000:  # Leave buffer before 32K limit
        # Summarize and reset
        summary_prompt = "Please summarize our conversation so far."
        summary, _ = model.chat(tokenizer, summary_prompt, history=history)
        history = [summary]  # Start fresh with summary

Code Analysis

# Analyze large codebases
import os
import glob

def collect_code_files(directory, extension=".py"):
    """Collect all code files from directory."""
    code_files = []
    for filepath in glob.glob(f"{directory}/**/*{extension}", recursive=True):
        with open(filepath, 'r') as f:
            code_files.append({
                'path': filepath,
                'content': f.read()
            })
    return code_files

# Collect files
files = collect_code_files('./my_project')

# Combine into single context
codebase = "\n\n".join([
    f"# File: {f['path']}\n{f['content']}"
    for f in files
])

query = f"""Analyze this codebase:

{codebase}

Provide:
1. Overview of architecture
2. Main components and their responsibilities
3. Potential improvements
4. Security concerns"""

response, _ = model.chat(tokenizer, query, history=None)

Memory Optimization for Long Context

KV Cache Quantization

Reduce memory usage when processing long contexts:
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,  # Enable KV cache quantization
    use_cache_kernel=True,
    use_flash_attn=False  # Cannot use with KV cache quantization
)
Memory Savings with KV Cache Quantization:
Sequence LengthWithout QuantizationWith QuantizationSavings
51215.2 GB15.0 GB200 MB
102416.3 GB15.5 GB800 MB
204817.6 GB15.8 GB1.8 GB
409619.5 GB16.6 GB2.9 GB
819223.2 GB17.6 GB5.6 GB

Batch Size Optimization

Batch SizeMemory Usage
116.3 GB
424.1 GB
1631.7 GB
3248.7 GB
64OOM

Best Practices for Long Context

Chunk Strategically

For extremely long documents, chunk logically and process with overlap

Use Summarization

Summarize earlier parts of long conversations to manage context

Monitor Token Count

Track token usage to avoid hitting context limits

Enable KV Quantization

Use KV cache quantization for longer sequences

Token Management

def manage_context(tokenizer, text, max_tokens=30000):
    """
    Ensure text fits within token limit.
    
    Args:
        tokenizer: Qwen tokenizer
        text: Input text
        max_tokens: Maximum allowed tokens
    
    Returns:
        Truncated text if necessary
    """
    tokens = tokenizer.encode(text)
    
    if len(tokens) > max_tokens:
        # Truncate from beginning (keep most recent)
        tokens = tokens[-max_tokens:]
        text = tokenizer.decode(tokens)
        print(f"Warning: Text truncated to {max_tokens} tokens")
    
    return text

# Usage
processed_text = manage_context(tokenizer, very_long_text)
response, _ = model.chat(tokenizer, processed_text, history=None)

Performance Considerations

Important Notes:
  • Memory: Long contexts require significant GPU memory. Consider using multiple GPUs or KV cache quantization
  • Speed: Generation speed decreases with longer contexts due to attention computation
  • Quality: While Qwen maintains strong performance at long contexts, accuracy may vary by task
  • Flash Attention: Using Flash Attention can significantly improve speed and memory efficiency

Supported Models

Long context support by model:
  • Qwen-1.8B: 32K tokens
  • Qwen-7B: 32K tokens (extended from 8K)
  • ⚠️ Qwen-14B: 8K tokens
  • Qwen-72B: 32K tokens

Next Steps

System Prompts

Use system prompts to guide long context processing

Agent Building

Build agents that leverage long context

Build docs developers (and LLMs) love