Skip to main content
Beyond the basic web demo, Qwen can be integrated with Gradio in advanced ways to create sophisticated applications. This guide covers custom implementations, optimization techniques, and production-ready patterns.

Overview

The Qwen web demo (web_demo.py) serves as a foundation for building custom Gradio applications. This page explores advanced patterns, customizations, and best practices for production deployments.

Architecture

The Gradio integration uses several key components:
Gradio Interface
    ├── UI Components (Blocks API)
    │   ├── Chatbot widget
    │   ├── Textbox input
    │   └── Action buttons
    ├── State Management
    │   ├── Conversation history
    │   └── Task state
    ├── Text Processing
    │   ├── Markdown rendering (mdtex2html)
    │   └── Code highlighting
    └── Model Integration
        ├── Streaming generation
        └── Memory management

Custom Text Processing

Markdown Enhancement

The demo uses mdtex2html for enhanced markdown rendering:
web_demo.py:64
def postprocess(self, y):
    if y is None:
        return []
    for i, (message, response) in enumerate(y):
        y[i] = (
            None if message is None else mdtex2html.convert(message),
            None if response is None else mdtex2html.convert(response),
        )
    return y

gr.Chatbot.postprocess = postprocess
This enables:
  • LaTeX equation rendering
  • Enhanced table formatting
  • Better code block styling
  • Proper handling of special characters

Code Block Formatting

The demo includes custom logic to properly format code blocks with syntax highlighting.
web_demo.py:82
if "```" in line:
    count += 1
    items = line.split("`")
    if count % 2 == 1:
        lines[i] = f'<pre><code class="language-{items[-1]}">'
    else:
        lines[i] = f"<br></code></pre>"

Special Character Handling

Inside code blocks, special characters are escaped:
web_demo.py:93
line = line.replace("`", r"\`")
line = line.replace("<", "&lt;")
line = line.replace(">", "&gt;")
line = line.replace(" ", "&nbsp;")
line = line.replace("*", "&ast;")
line = line.replace("_", "&lowbar;")
line = line.replace("-", "&#45;")

State Management

Conversation History

Gradio’s State component maintains conversation context:
web_demo.py:173
task_history = gr.State([])
The history structure:
task_history = [
    ("User message 1", "Assistant response 1"),
    ("User message 2", "Assistant response 2"),
    # ...
]

Display vs. Task History

The demo maintains two separate histories:
  1. Chatbot Display (_chatbot): Formatted for UI display
  2. Task History (_task_history): Raw text for model context
web_demo.py:120
def predict(_query, _chatbot, _task_history):
    print(f"User: {_parse_text(_query)}")
    _chatbot.append((_parse_text(_query), ""))  # Display
    
    # ... generation ...
    
    _task_history.append((_query, full_response))  # Raw history
This separation ensures:
  • Clean display with formatting
  • Accurate model context without HTML
  • Independent management of each

Streaming Implementation

Real-Time Response Generation

The demo implements streaming using Python generators:
web_demo.py:124
for response in model.chat_stream(tokenizer, _query, history=_task_history, generation_config=config):
    _chatbot[-1] = (_parse_text(_query), _parse_text(response))
    yield _chatbot
    full_response = _parse_text(response)
Each yield statement updates the UI in real-time, creating a smooth streaming effect.

Benefits of Streaming

  • Immediate Feedback: Users see responses start appearing instantly
  • Better UX: Reduces perceived latency
  • Interruptible: Can stop generation if needed
  • Progress Indication: Shows the model is working

UI Components

Custom Branding

The interface includes Qwen branding:
web_demo.py:152
gr.Markdown("""
<p align="center"><img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg" style="height: 80px"/><p>""")
gr.Markdown("""<center><font size=8>Qwen-Chat Bot</center>""")
The demo displays links to model resources:
web_demo.py:160
gr.Markdown("""
<center><font size=4>
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 </a> | 
<a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp | 
Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 </a> | 
<a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | 
...
</center>""")

Action Buttons

Three main buttons control the interface:
web_demo.py:175
with gr.Row():
    empty_btn = gr.Button("🧹 Clear History (清除历史)")
    submit_btn = gr.Button("🚀 Submit (发送)")
    regen_btn = gr.Button("🤔️ Regenerate (重试)")

Button Event Handlers

web_demo.py:180
submit_btn.click(predict, [query, chatbot, task_history], [chatbot], show_progress=True)
submit_btn.click(reset_user_input, [], [query])
empty_btn.click(reset_state, [chatbot, task_history], outputs=[chatbot], show_progress=True)
regen_btn.click(regenerate, [chatbot, task_history], [chatbot], show_progress=True)

Custom Implementations

Adding System Prompts

Extend the demo to support custom system prompts:
def predict_with_system(_query, _system_prompt, _chatbot, _task_history):
    # Prepend system prompt to history
    if _system_prompt and len(_task_history) == 0:
        _task_history.append((f"<system>{_system_prompt}</system>", "Understood."))
    
    # Continue with normal prediction
    for response in model.chat_stream(tokenizer, _query, history=_task_history):
        # ...
Add to UI:
system_prompt = gr.Textbox(label="System Prompt", placeholder="You are a helpful assistant...")

Multi-Model Support

Allow users to switch between models:
models = {
    "Qwen-7B-Chat": load_model("Qwen/Qwen-7B-Chat"),
    "Qwen-14B-Chat": load_model("Qwen/Qwen-14B-Chat"),
}

def predict_multi_model(_query, _model_name, _chatbot, _task_history):
    model = models[_model_name]
    # Use selected model for generation
Add model selector:
model_dropdown = gr.Dropdown(
    choices=["Qwen-7B-Chat", "Qwen-14B-Chat"],
    value="Qwen-7B-Chat",
    label="Model"
)

Generation Configuration UI

Add controls for generation parameters:
with gr.Accordion("Generation Settings", open=False):
    temperature = gr.Slider(0.1, 2.0, value=0.7, step=0.1, label="Temperature")
    top_p = gr.Slider(0.1, 1.0, value=0.8, step=0.05, label="Top-p")
    max_tokens = gr.Slider(128, 4096, value=2048, step=128, label="Max Tokens")

def predict_with_config(_query, _temp, _top_p, _max_tokens, _chatbot, _task_history):
    # Update generation config
    config = GenerationConfig(
        temperature=_temp,
        top_p=_top_p,
        max_new_tokens=_max_tokens
    )
    for response in model.chat_stream(tokenizer, _query, history=_task_history, generation_config=config):
        # ...

Export Conversation

Add functionality to export chat history:
import json
from datetime import datetime

def export_conversation(_task_history):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"qwen_chat_{timestamp}.json"
    
    data = {
        "timestamp": timestamp,
        "conversation": [
            {"role": "user", "content": user_msg}
            {"role": "assistant", "content": bot_msg}
            for user_msg, bot_msg in _task_history
        ]
    }
    
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    
    return filename

export_btn = gr.Button("💾 Export")
export_btn.click(export_conversation, [task_history], [gr.File()])

Performance Optimization

Model Loading

Optimize model loading for faster startup:
# Cache models globally
_model_cache = {}

def get_model(checkpoint_path):
    if checkpoint_path not in _model_cache:
        model = AutoModelForCausalLM.from_pretrained(
            checkpoint_path,
            device_map="auto",
            trust_remote_code=True
        ).eval()
        _model_cache[checkpoint_path] = model
    return _model_cache[checkpoint_path]

Memory Management

Implement aggressive memory management:
import torch
import gc

def aggressive_cleanup():
    """Thorough memory cleanup"""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

# Call after clearing history or long conversations
empty_btn.click(lambda: aggressive_cleanup(), None, None)

Response Caching

Cache common responses to reduce computation:
from functools import lru_cache
import hashlib

response_cache = {}

def get_cache_key(query, history):
    content = query + str(history)
    return hashlib.md5(content.encode()).hexdigest()

def predict_with_cache(_query, _chatbot, _task_history):
    cache_key = get_cache_key(_query, _task_history)
    
    if cache_key in response_cache:
        response = response_cache[cache_key]
        _chatbot.append((_parse_text(_query), _parse_text(response)))
        return _chatbot
    
    # Normal generation...
    response_cache[cache_key] = full_response

Concurrent Request Handling

Gradio’s queue system handles concurrency, but you can optimize:
demo.queue(
    concurrency_count=4,  # Process up to 4 requests simultaneously
    max_size=20,          # Queue up to 20 requests
).launch(
    # launch options...
)

Production Best Practices

Error Handling

Implement robust error handling:
def predict_safe(_query, _chatbot, _task_history):
    try:
        for response in model.chat_stream(tokenizer, _query, history=_task_history):
            _chatbot[-1] = (_parse_text(_query), _parse_text(response))
            yield _chatbot
    except Exception as e:
        error_msg = f"Error: {str(e)}"
        _chatbot[-1] = (_parse_text(_query), error_msg)
        yield _chatbot
        print(f"Error in generation: {e}")

Logging

Add comprehensive logging:
import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('qwen_demo.log'),
        logging.StreamHandler()
    ]
)

def predict_with_logging(_query, _chatbot, _task_history):
    logging.info(f"User query: {_query}")
    
    start_time = time.time()
    for response in model.chat_stream(...):
        # ...
    
    duration = time.time() - start_time
    logging.info(f"Generation completed in {duration:.2f}s")

Rate Limiting

Protect against abuse:
from collections import defaultdict
import time

user_requests = defaultdict(list)
RATE_LIMIT = 10  # requests per minute

def check_rate_limit(user_id):
    now = time.time()
    # Remove old requests
    user_requests[user_id] = [
        t for t in user_requests[user_id] 
        if now - t < 60
    ]
    
    if len(user_requests[user_id]) >= RATE_LIMIT:
        return False
    
    user_requests[user_id].append(now)
    return True

Health Monitoring

Add health check endpoint:
def health_check():
    try:
        # Simple model test
        test_response = model.chat(tokenizer, "Hi", history=None)[0]
        return "✓ Healthy"
    except Exception as e:
        return f"✗ Unhealthy: {e}"

health_status = gr.Textbox(label="System Health", interactive=False)
health_btn = gr.Button("Check Health")
health_btn.click(health_check, None, health_status)

Integration Examples

With Authentication

demo.queue().launch(
    auth=[('admin', 'password123'), ('user', 'userpass')],
    auth_message="Enter credentials to access Qwen Chat",
    server_port=8000,
)

With Analytics

import analytics

def predict_with_analytics(_query, _chatbot, _task_history):
    # Track usage
    analytics.track('chat_message', {
        'query_length': len(_query),
        'history_length': len(_task_history)
    })
    
    # Normal prediction
    yield from predict(_query, _chatbot, _task_history)

With Database Storage

import sqlite3

def save_conversation(_task_history, user_id):
    conn = sqlite3.connect('conversations.db')
    cursor = conn.cursor()
    
    for user_msg, bot_msg in _task_history:
        cursor.execute(
            'INSERT INTO messages (user_id, role, content) VALUES (?, ?, ?)',
            (user_id, 'user', user_msg)
        )
        cursor.execute(
            'INSERT INTO messages (user_id, role, content) VALUES (?, ?, ?)',
            (user_id, 'assistant', bot_msg)
        )
    
    conn.commit()
    conn.close()

Troubleshooting

Implement periodic cleanup:
MAX_HISTORY_LENGTH = 20

def trim_history(_task_history):
    if len(_task_history) > MAX_HISTORY_LENGTH:
        _task_history = _task_history[-MAX_HISTORY_LENGTH:]
    return _task_history
Profile your code:
import time

start = time.time()
response = model.chat(...)
print(f"Generation took: {time.time() - start:.2f}s")
Consider:
  • Using quantized models
  • Enabling Flash Attention
  • Reducing max tokens
  • Batch processing
Ensure you’re yielding updates:
for response in model.chat_stream(...):
    _chatbot[-1] = (query, response)
    yield _chatbot  # This is crucial!

Source Code Reference

Key files in the Qwen repository:
  • Main demo: web_demo.py:1
  • Text processing: web_demo.py:78
  • Prediction function: web_demo.py:119
  • UI definition: web_demo.py:151

Next Steps

CLI Demo

Explore the command-line interface

API Reference

Learn about the model API

Deployment Guide

Deploy Qwen in production

Examples

More examples on GitHub

Build docs developers (and LLMs) love