Beyond the basic web demo, Qwen can be integrated with Gradio in advanced ways to create sophisticated applications. This guide covers custom implementations, optimization techniques, and production-ready patterns.
Overview
The Qwen web demo (web_demo.py) serves as a foundation for building custom Gradio applications. This page explores advanced patterns, customizations, and best practices for production deployments.
Architecture
The Gradio integration uses several key components:
Gradio Interface
├── UI Components (Blocks API )
│ ├── Chatbot widget
│ ├── Textbox input
│ └── Action buttons
├── State Management
│ ├── Conversation history
│ └── Task state
├── Text Processing
│ ├── Markdown rendering (mdtex2html)
│ └── Code highlighting
└── Model Integration
├── Streaming generation
└── Memory management
Custom Text Processing
Markdown Enhancement
The demo uses mdtex2html for enhanced markdown rendering:
def postprocess ( self , y ):
if y is None :
return []
for i, (message, response) in enumerate (y):
y[i] = (
None if message is None else mdtex2html.convert(message),
None if response is None else mdtex2html.convert(response),
)
return y
gr.Chatbot.postprocess = postprocess
This enables:
LaTeX equation rendering
Enhanced table formatting
Better code block styling
Proper handling of special characters
The demo includes custom logic to properly format code blocks with syntax highlighting.
if "```" in line:
count += 1
items = line.split( "`" )
if count % 2 == 1 :
lines[i] = f '<pre><code class="language- { items[ - 1 ] } ">'
else :
lines[i] = f "<br></code></pre>"
Special Character Handling
Inside code blocks, special characters are escaped:
line = line.replace( "`" , r " \` " )
line = line.replace( "<" , "<" )
line = line.replace( ">" , ">" )
line = line.replace( " " , " " )
line = line.replace( "*" , "*" )
line = line.replace( "_" , "_" )
line = line.replace( "-" , "-" )
State Management
Conversation History
Gradio’s State component maintains conversation context:
task_history = gr.State([])
The history structure:
task_history = [
( "User message 1" , "Assistant response 1" ),
( "User message 2" , "Assistant response 2" ),
# ...
]
Display vs. Task History
The demo maintains two separate histories:
Chatbot Display (_chatbot): Formatted for UI display
Task History (_task_history): Raw text for model context
def predict ( _query , _chatbot , _task_history ):
print ( f "User: { _parse_text(_query) } " )
_chatbot.append((_parse_text(_query), "" )) # Display
# ... generation ...
_task_history.append((_query, full_response)) # Raw history
This separation ensures:
Clean display with formatting
Accurate model context without HTML
Independent management of each
Streaming Implementation
Real-Time Response Generation
The demo implements streaming using Python generators:
for response in model.chat_stream(tokenizer, _query, history = _task_history, generation_config = config):
_chatbot[ - 1 ] = (_parse_text(_query), _parse_text(response))
yield _chatbot
full_response = _parse_text(response)
Each yield statement updates the UI in real-time, creating a smooth streaming effect.
Benefits of Streaming
Immediate Feedback : Users see responses start appearing instantly
Better UX : Reduces perceived latency
Interruptible : Can stop generation if needed
Progress Indication : Shows the model is working
UI Components
Custom Branding
The interface includes Qwen branding:
gr.Markdown( """
<p align="center"><img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg" style="height: 80px"/><p>""" )
gr.Markdown( """<center><font size=8>Qwen-Chat Bot</center>""" )
Model Links
The demo displays links to model resources:
gr.Markdown( """
<center><font size=4>
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 </a> |
<a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>  |
Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 </a> |
<a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>  |
...
</center>""" )
Three main buttons control the interface:
with gr.Row():
empty_btn = gr.Button( "🧹 Clear History (清除历史)" )
submit_btn = gr.Button( "🚀 Submit (发送)" )
regen_btn = gr.Button( "🤔️ Regenerate (重试)" )
submit_btn.click(predict, [query, chatbot, task_history], [chatbot], show_progress = True )
submit_btn.click(reset_user_input, [], [query])
empty_btn.click(reset_state, [chatbot, task_history], outputs = [chatbot], show_progress = True )
regen_btn.click(regenerate, [chatbot, task_history], [chatbot], show_progress = True )
Custom Implementations
Adding System Prompts
Extend the demo to support custom system prompts:
def predict_with_system ( _query , _system_prompt , _chatbot , _task_history ):
# Prepend system prompt to history
if _system_prompt and len (_task_history) == 0 :
_task_history.append(( f "<system> { _system_prompt } </system>" , "Understood." ))
# Continue with normal prediction
for response in model.chat_stream(tokenizer, _query, history = _task_history):
# ...
Add to UI:
system_prompt = gr.Textbox( label = "System Prompt" , placeholder = "You are a helpful assistant..." )
Multi-Model Support
Allow users to switch between models:
models = {
"Qwen-7B-Chat" : load_model( "Qwen/Qwen-7B-Chat" ),
"Qwen-14B-Chat" : load_model( "Qwen/Qwen-14B-Chat" ),
}
def predict_multi_model ( _query , _model_name , _chatbot , _task_history ):
model = models[_model_name]
# Use selected model for generation
Add model selector:
model_dropdown = gr.Dropdown(
choices = [ "Qwen-7B-Chat" , "Qwen-14B-Chat" ],
value = "Qwen-7B-Chat" ,
label = "Model"
)
Generation Configuration UI
Add controls for generation parameters:
with gr.Accordion( "Generation Settings" , open = False ):
temperature = gr.Slider( 0.1 , 2.0 , value = 0.7 , step = 0.1 , label = "Temperature" )
top_p = gr.Slider( 0.1 , 1.0 , value = 0.8 , step = 0.05 , label = "Top-p" )
max_tokens = gr.Slider( 128 , 4096 , value = 2048 , step = 128 , label = "Max Tokens" )
def predict_with_config ( _query , _temp , _top_p , _max_tokens , _chatbot , _task_history ):
# Update generation config
config = GenerationConfig(
temperature = _temp,
top_p = _top_p,
max_new_tokens = _max_tokens
)
for response in model.chat_stream(tokenizer, _query, history = _task_history, generation_config = config):
# ...
Export Conversation
Add functionality to export chat history:
import json
from datetime import datetime
def export_conversation ( _task_history ):
timestamp = datetime.now().strftime( "%Y%m %d _%H%M%S" )
filename = f "qwen_chat_ { timestamp } .json"
data = {
"timestamp" : timestamp,
"conversation" : [
{ "role" : "user" , "content" : user_msg}
{ "role" : "assistant" , "content" : bot_msg}
for user_msg, bot_msg in _task_history
]
}
with open (filename, 'w' , encoding = 'utf-8' ) as f:
json.dump(data, f, ensure_ascii = False , indent = 2 )
return filename
export_btn = gr.Button( "💾 Export" )
export_btn.click(export_conversation, [task_history], [gr.File()])
Model Loading
Optimize model loading for faster startup:
# Cache models globally
_model_cache = {}
def get_model ( checkpoint_path ):
if checkpoint_path not in _model_cache:
model = AutoModelForCausalLM.from_pretrained(
checkpoint_path,
device_map = "auto" ,
trust_remote_code = True
).eval()
_model_cache[checkpoint_path] = model
return _model_cache[checkpoint_path]
Memory Management
Implement aggressive memory management:
import torch
import gc
def aggressive_cleanup ():
"""Thorough memory cleanup"""
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
# Call after clearing history or long conversations
empty_btn.click( lambda : aggressive_cleanup(), None , None )
Response Caching
Cache common responses to reduce computation:
from functools import lru_cache
import hashlib
response_cache = {}
def get_cache_key ( query , history ):
content = query + str (history)
return hashlib.md5(content.encode()).hexdigest()
def predict_with_cache ( _query , _chatbot , _task_history ):
cache_key = get_cache_key(_query, _task_history)
if cache_key in response_cache:
response = response_cache[cache_key]
_chatbot.append((_parse_text(_query), _parse_text(response)))
return _chatbot
# Normal generation...
response_cache[cache_key] = full_response
Concurrent Request Handling
Gradio’s queue system handles concurrency, but you can optimize:
demo.queue(
concurrency_count = 4 , # Process up to 4 requests simultaneously
max_size = 20 , # Queue up to 20 requests
).launch(
# launch options...
)
Production Best Practices
Error Handling
Implement robust error handling:
def predict_safe ( _query , _chatbot , _task_history ):
try :
for response in model.chat_stream(tokenizer, _query, history = _task_history):
_chatbot[ - 1 ] = (_parse_text(_query), _parse_text(response))
yield _chatbot
except Exception as e:
error_msg = f "Error: { str (e) } "
_chatbot[ - 1 ] = (_parse_text(_query), error_msg)
yield _chatbot
print ( f "Error in generation: { e } " )
Logging
Add comprehensive logging:
import logging
logging.basicConfig(
level = logging. INFO ,
format = ' %(asctime)s - %(levelname)s - %(message)s ' ,
handlers = [
logging.FileHandler( 'qwen_demo.log' ),
logging.StreamHandler()
]
)
def predict_with_logging ( _query , _chatbot , _task_history ):
logging.info( f "User query: { _query } " )
start_time = time.time()
for response in model.chat_stream( ... ):
# ...
duration = time.time() - start_time
logging.info( f "Generation completed in { duration :.2f} s" )
Rate Limiting
Protect against abuse:
from collections import defaultdict
import time
user_requests = defaultdict( list )
RATE_LIMIT = 10 # requests per minute
def check_rate_limit ( user_id ):
now = time.time()
# Remove old requests
user_requests[user_id] = [
t for t in user_requests[user_id]
if now - t < 60
]
if len (user_requests[user_id]) >= RATE_LIMIT :
return False
user_requests[user_id].append(now)
return True
Health Monitoring
Add health check endpoint:
def health_check ():
try :
# Simple model test
test_response = model.chat(tokenizer, "Hi" , history = None )[ 0 ]
return "✓ Healthy"
except Exception as e:
return f "✗ Unhealthy: { e } "
health_status = gr.Textbox( label = "System Health" , interactive = False )
health_btn = gr.Button( "Check Health" )
health_btn.click(health_check, None , health_status)
Integration Examples
With Authentication
demo.queue().launch(
auth = [( 'admin' , 'password123' ), ( 'user' , 'userpass' )],
auth_message = "Enter credentials to access Qwen Chat" ,
server_port = 8000 ,
)
With Analytics
import analytics
def predict_with_analytics ( _query , _chatbot , _task_history ):
# Track usage
analytics.track( 'chat_message' , {
'query_length' : len (_query),
'history_length' : len (_task_history)
})
# Normal prediction
yield from predict(_query, _chatbot, _task_history)
With Database Storage
import sqlite3
def save_conversation ( _task_history , user_id ):
conn = sqlite3.connect( 'conversations.db' )
cursor = conn.cursor()
for user_msg, bot_msg in _task_history:
cursor.execute(
'INSERT INTO messages (user_id, role, content) VALUES (?, ?, ?)' ,
(user_id, 'user' , user_msg)
)
cursor.execute(
'INSERT INTO messages (user_id, role, content) VALUES (?, ?, ?)' ,
(user_id, 'assistant' , bot_msg)
)
conn.commit()
conn.close()
Troubleshooting
Memory leaks in long sessions
Implement periodic cleanup: MAX_HISTORY_LENGTH = 20
def trim_history ( _task_history ):
if len (_task_history) > MAX_HISTORY_LENGTH :
_task_history = _task_history[ - MAX_HISTORY_LENGTH :]
return _task_history
Profile your code: import time
start = time.time()
response = model.chat( ... )
print ( f "Generation took: { time.time() - start :.2f} s" )
Consider:
Using quantized models
Enabling Flash Attention
Reducing max tokens
Batch processing
Ensure you’re yielding updates: for response in model.chat_stream( ... ):
_chatbot[ - 1 ] = (query, response)
yield _chatbot # This is crucial!
Source Code Reference
Key files in the Qwen repository:
Main demo: web_demo.py:1
Text processing: web_demo.py:78
Prediction function: web_demo.py:119
UI definition: web_demo.py:151
Next Steps
CLI Demo Explore the command-line interface
API Reference Learn about the model API
Deployment Guide Deploy Qwen in production
Examples More examples on GitHub