CLI Demo

The CLI demo provides a powerful command-line interface for interacting with Qwen-Chat models. This demo supports multi-turn conversations with history management, streaming responses, and configurable generation parameters.

Overview

The CLI demo (cli_demo.py) offers an interactive chat experience directly in your terminal with features including:

Real-time streaming responses
Conversation history management
Dynamic generation configuration
Random seed control for reproducibility
CPU-only mode support

Installation

Install Dependencies

Make sure you have the required packages installed:

pip install torch transformers transformers_stream_generator

Optional: Install Flash Attention

For improved performance (GPU only):

pip install flash-attn --no-build-isolation

Basic Usage

Quick Start

Run the demo with default settings (Qwen-7B-Chat):

python cli_demo.py

Command-Line Options

The CLI demo supports the following arguments:

-c, --checkpoint-path

string

default:"Qwen/Qwen-7B-Chat"

Model checkpoint name or path from HuggingFace/ModelScope

-s, --seed

integer

default:"1234"

Random seed for reproducible generation

--cpu-only

flag

Run the demo with CPU only (no GPU required)

Usage Examples

# Use Qwen-14B-Chat
python cli_demo.py -c Qwen/Qwen-14B-Chat

# Use Qwen-1.8B-Chat
python cli_demo.py -c Qwen/Qwen-1_8B-Chat

# Use local model path
python cli_demo.py -c /path/to/your/model

Interactive Commands

Once the demo is running, you can use these special commands:

Help and Information

Command	Aliases	Description
`:help`	`:h`	Display all available commands
`:history`	`:his`	Show conversation history
`:conf`	-	Show current generation configuration
`:seed`	-	Show current random seed

Session Management

Command	Aliases	Description
`:clear`	`:cl`	Clear the screen
`:clear-his`	`:clh`	Clear conversation history
`:exit`	`:quit`, `:q`	Exit the demo

Configuration

Configuration changes persist for the current session only.

View Configuration:

:conf

Modify Configuration:

:conf <key>=<value>

Reset to Default:

:reset-conf

Common Configuration Parameters

:conf temperature=0.8

Random Seed Control

Check Current Seed:

:seed

Set New Seed:

:seed 42

Example Session

Here’s what a typical interaction looks like:

cli_demo.py:19

Welcome to use Qwen-Chat model, type text to start chat, type :h to show command help.
(欢迎使用 Qwen-Chat 模型，输入内容即可进行对话，:h 显示命令帮助。)

Note: This demo is governed by the original license of Qwen.
We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content.

User> Hello! Can you introduce yourself?

Qwen-Chat: Hello! I'm Qwen, a large language model created by Alibaba Cloud. 
I'm designed to understand and generate human-like text, answer questions, 
provide information, and assist with various tasks. How can I help you today?

User> :seed 999
[INFO] Random seed changed to 999

User> :conf temperature=0.7
[INFO] Change config: model.generation_config.temperature = 0.7

User> :history
================ History (1) ================
User[0]: Hello! Can you introduce yourself?
QWen[0]: Hello! I'm Qwen, a large language model...
=============================================

User> :clear-his
[INFO] All 1 history cleared

User> :exit

Features

Streaming Responses

The CLI demo uses model.chat_stream() method to provide real-time streaming responses:

cli_demo.py:198

for response in model.chat_stream(tokenizer, query, history=history, generation_config=config):
    _clear_screen()
    print(f"\nUser: {query}")
    print(f"\nQwen-Chat: {response}")

This creates a smooth, interactive experience where you see the response being generated token by token.

History Management

Conversations are automatically tracked:

cli_demo.py:206

history.append((query, response))

You can:

View all previous exchanges with :history
Clear history with :clear-his to start fresh
History is preserved across multiple turns

Keyboard Interrupt Handling

Press Ctrl+C during generation to interrupt:

cli_demo.py:202

try:
    for response in model.chat_stream(...):
        # streaming...
except KeyboardInterrupt:
    print('[WARNING] Generation interrupted')
    continue

Performance Tips

GPU Memory Management: The demo includes automatic garbage collection and CUDA cache clearing when you clear history or screen.

Memory Optimization

The demo automatically manages memory:

cli_demo.py:68

def _gc():
    import gc
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

Memory is cleared when:

Clearing the screen (:clear)
Clearing history (:clear-his)

Device Selection

# Automatically uses available GPUs
model = AutoModelForCausalLM.from_pretrained(
    args.checkpoint_path,
    device_map="auto",
    trust_remote_code=True
).eval()

Troubleshooting

Model loading is slow

First-time model loading downloads the model from HuggingFace/ModelScope. This can take time depending on your connection. Subsequent runs will use the cached model.Consider downloading the model manually:

from modelscope import snapshot_download
model_dir = snapshot_download('qwen/Qwen-7B-Chat')

Out of memory errors

Try these solutions:

Use a smaller model (e.g., Qwen-1.8B-Chat instead of Qwen-7B-Chat)
Enable CPU-only mode with --cpu-only
Use quantized models (Int4 or Int8 versions)
Clear history frequently with :clear-his

Unicode decode errors

The demo handles encoding errors automatically:

cli_demo.py:95

except UnicodeDecodeError:
    print('[ERROR] Encoding error in input')
    continue

If issues persist, check your terminal’s encoding settings.

Generation produces unexpected results

Try:

Adjusting temperature: :conf temperature=0.7 (lower = more focused)
Changing the random seed: :seed 42
Resetting config: :reset-conf
Clearing history if context is confusing: :clear-his

Source Code Reference

The CLI demo implementation can be found at cli_demo.py:1 in the Qwen repository. Key components:

Model loading: cli_demo.py:44
Main loop: cli_demo.py:105
Command processing: cli_demo.py:128
Chat streaming: cli_demo.py:198

Next Steps

Web Demo

Try the Gradio-based web interface

Model API

Integrate Qwen into your applications

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

Overview

Installation

Basic Usage

Quick Start

Command-Line Options

Usage Examples

Interactive Commands

Help and Information

Session Management

Configuration

Common Configuration Parameters

Random Seed Control

Example Session

Features

Streaming Responses

History Management

Keyboard Interrupt Handling

Performance Tips

Memory Optimization

Device Selection

Troubleshooting

Source Code Reference

Next Steps

Web Demo

Model API

Build docs developers (and LLMs) love

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Overview

​Installation

​Basic Usage

​Quick Start

​Command-Line Options

​Usage Examples

​Interactive Commands

​Help and Information

​Session Management

​Configuration

​Common Configuration Parameters

​Random Seed Control

​Example Session

​Features

​Streaming Responses

​History Management

​Keyboard Interrupt Handling

​Performance Tips

​Memory Optimization

​Device Selection

​Troubleshooting

​Source Code Reference

​Next Steps

Web Demo

Model API

Build docs developers (and LLMs) love

Overview

Installation

Basic Usage

Quick Start

Command-Line Options

Usage Examples

Interactive Commands

Help and Information

Session Management

Configuration

Common Configuration Parameters

Random Seed Control

Example Session

Features

Streaming Responses

History Management

Keyboard Interrupt Handling

Performance Tips

Memory Optimization

Device Selection

Troubleshooting

Source Code Reference

Next Steps