Skip to main content
The Web Demo provides a user-friendly browser interface for interacting with Qwen-Chat models using Gradio. This demo supports multi-turn conversations, message regeneration, and can be easily shared or deployed.

Overview

The web demo (web_demo.py) creates an interactive chat interface with:
  • Modern web-based UI with chat bubbles
  • Markdown and code syntax highlighting
  • Message regeneration capability
  • History management
  • Shareable public links
  • Auto-launch in browser
  • Customizable server settings

Installation

1

Install Required Packages

Install the core dependencies:
pip install torch transformers gradio mdtex2html
2

Verify Gradio Version

The demo requires Gradio 3.x or 4.x:
pip show gradio

Basic Usage

Quick Start

Launch the web demo with default settings:
python web_demo.py
The demo will start on http://127.0.0.1:8000 by default.

Command-Line Options

-c, --checkpoint-path
string
default:"Qwen/Qwen-7B-Chat"
Model checkpoint name or path from HuggingFace/ModelScope
--cpu-only
flag
Run the demo with CPU only (no GPU required)
--share
flag
default:"false"
Create a publicly shareable Gradio link (tunnels through Gradio servers)
--inbrowser
flag
default:"false"
Automatically open the interface in your default browser
--server-port
integer
default:"8000"
Port number for the web server
--server-name
string
default:"127.0.0.1"
Server hostname or IP address (use “0.0.0.0” to allow external access)

Usage Examples

# Start with default settings
python web_demo.py

# Access at http://127.0.0.1:8000

Interface Features

Chat Interface

The web demo provides a clean, modern chat interface with:
  1. Chat Display: Shows conversation history with proper formatting
  2. Input Box: Multi-line text input for your messages
  3. Action Buttons:
    • 🚀 Submit: Send your message
    • 🧹 Clear History: Reset the conversation
    • 🤔 Regenerate: Re-generate the last response

Message Formatting

The demo supports rich text formatting:
Messages are rendered with full Markdown support:
  • Bold and italic text
  • Lists and bullet points
  • Links and quotes
  • Tables

Key Functions

Message Submission

When you submit a message, the interface:
  1. Displays your message in the chat
  2. Shows a streaming response from the model
  3. Updates the conversation history
web_demo.py:119
def predict(_query, _chatbot, _task_history):
    print(f"User: {_parse_text(_query)}")
    _chatbot.append((_parse_text(_query), ""))
    full_response = ""

    for response in model.chat_stream(tokenizer, _query, history=_task_history, generation_config=config):
        _chatbot[-1] = (_parse_text(_query), _parse_text(response))
        yield _chatbot
        full_response = _parse_text(response)

    _task_history.append((_query, full_response))

Regenerate Response

Click the “Regenerate” button to get a different response to your last message:
web_demo.py:134
def regenerate(_chatbot, _task_history):
    if not _task_history:
        yield _chatbot
        return
    item = _task_history.pop(-1)
    _chatbot.pop(-1)
    yield from predict(item[0], _chatbot, _task_history)

Clear History

Reset the conversation and free up memory:
web_demo.py:145
def reset_state(_chatbot, _task_history):
    _task_history.clear()
    _chatbot.clear()
    _gc()
    return _chatbot

Deployment Options

Local Network Access

Allow other devices on your network to access the demo:
python web_demo.py --server-name 0.0.0.0 --server-port 8000
Then access from other devices using:
http://<your-machine-ip>:8000

Public Sharing

Public sharing creates a temporary URL (typically expires in 72 hours) that tunnels through Gradio’s servers. Be cautious when sharing sensitive or proprietary models.
Create a public shareable link:
python web_demo.py --share
Output:
Running on local URL:  http://127.0.0.1:8000
Running on public URL: https://xxxxx.gradio.live

This share link expires in 72 hours.
Share the public URL with others - no installation required on their end!

Production Deployment

For production environments, consider:
Create a Dockerfile:
FROM python:3.10

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY web_demo.py .

EXPOSE 8000
CMD ["python", "web_demo.py", "--server-name", "0.0.0.0", "--server-port", "8000"]
Build and run:
docker build -t qwen-web-demo .
docker run -p 8000:8000 --gpus all qwen-web-demo

Customization

UI Customization

The demo interface is defined using Gradio Blocks:
web_demo.py:151
with gr.Blocks() as demo:
    gr.Markdown("""
<p align="center"><img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg" style="height: 80px"/><p>""")
    gr.Markdown("""<center><font size=8>Qwen-Chat Bot</center>""")
    
    chatbot = gr.Chatbot(label='Qwen-Chat', elem_classes="control-height")
    query = gr.Textbox(lines=2, label='Input')
    task_history = gr.State([])
You can customize:
  • Logo and branding
  • Colors and styling (via CSS)
  • Button labels and icons
  • Layout and spacing

Text Processing

The demo includes custom text processing for better display:
web_demo.py:78
def _parse_text(text):
    lines = text.split("\n")
    lines = [line for line in lines if line != ""]
    count = 0
    for i, line in enumerate(lines):
        if "```" in line:
            count += 1
            items = line.split("`")
            if count % 2 == 1:
                lines[i] = f'<pre><code class="language-{items[-1]}">'
            else:
                lines[i] = f"<br></code></pre>"
This function:
  • Formats code blocks properly
  • Handles special characters in code
  • Preserves whitespace and indentation

Performance Optimization

Memory Management

The demo automatically runs garbage collection when clearing history to free up GPU memory.
web_demo.py:110
def _gc():
    import gc
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

Queueing

Gradio’s queue system is enabled for better concurrency:
web_demo.py:192
demo.queue().launch(
    share=args.share,
    inbrowser=args.inbrowser,
    server_port=args.server_port,
    server_name=args.server_name,
)
This allows:
  • Multiple users to interact simultaneously
  • Requests to be processed in order
  • Better handling of long-running generations

Response Streaming

The demo uses streaming for real-time responses:
web_demo.py:124
for response in model.chat_stream(tokenizer, _query, history=_task_history, generation_config=config):
    _chatbot[-1] = (_parse_text(_query), _parse_text(response))
    yield _chatbot
Benefits:
  • Users see responses as they’re generated
  • Better perceived performance
  • Can stop generation early if needed

Troubleshooting

If port 8000 is occupied:
# Use a different port
python web_demo.py --server-port 8080

# Or find and kill the process using port 8000
lsof -ti:8000 | xargs kill -9
Make sure to:
  1. Use --server-name 0.0.0.0 to bind to all interfaces
  2. Check firewall settings allow the port
  3. Use the correct IP address (not 127.0.0.1)
# Find your IP
hostname -I

# Launch with external access
python web_demo.py --server-name 0.0.0.0
If you see import errors:
# Reinstall gradio
pip uninstall gradio
pip install gradio

# Or use a specific version
pip install gradio==4.0.0
To improve performance:
  1. Use GPU instead of CPU mode
  2. Use quantized models (Int4/Int8)
  3. Reduce max token length in generation config
  4. Enable Flash Attention if available
  5. Clear history regularly

Advanced Features

Custom CSS Styling

Add custom CSS to Gradio interface:
with gr.Blocks(css=".gradio-container {max-width: 1200px}") as demo:
    # Your interface code

Adding Authentication

Protect your demo with a password:
demo.queue().launch(
    auth=("username", "password"),
    server_port=args.server_port,
    server_name=args.server_name,
)

Multiple Concurrent Users

Gradio’s queue handles multiple users automatically, but for heavy loads consider:
  • Use multiple model replicas
  • Implement request batching
  • Add rate limiting
  • Deploy with a load balancer

Source Code Reference

The web demo implementation can be found at web_demo.py:1 in the Qwen repository. Key components:
  • Argument parsing: web_demo.py:21
  • Model loading: web_demo.py:40
  • Text processing: web_demo.py:78
  • Interface definition: web_demo.py:151
  • Launch configuration: web_demo.py:192

Next Steps

CLI Demo

Try the command-line interface

API Deployment

Deploy Qwen as an API service

Build docs developers (and LLMs) love