Chat API

Overview

The Qwen Chat API provides methods for conversational interactions with the model. It supports both synchronous and streaming responses, multi-turn conversations with history, and custom system prompts.

chat() Method

Generate a complete response for a user query:

response, updated_history = model.chat(
    tokenizer,
    query="What is quantum computing?",
    history=None,
    system="You are a helpful assistant."
)

print(response)

Parameters

tokenizer

AutoTokenizer

required

Tokenizer instance for encoding/decoding text

query

str

required

User’s current message or question

history

list[tuple[str, str]]

default:"None"

Conversation history as list of (user_message, assistant_response) tuples:

history = [
    ("Hello", "Hi! How can I help you today?"),
    ("What's the weather?", "I don't have access to weather data.")
]

system

str

default:"You are a helpful assistant."

System prompt defining the assistant’s behavior and role

stop_words_ids

list[list[int]]

default:"None"

Token ID sequences that trigger generation termination:

stop_words_ids = [
    tokenizer.encode("<|im_end|>"),
    tokenizer.encode("\n\n")
]

**gen_kwargs

dict

Additional generation parameters (see GenerationConfig)

Returns

response

str

The model’s generated response text

history

list[tuple[str, str]]

Updated conversation history including the current exchange

chat_stream() Method

Generate a streaming response for real-time display:

for partial_response in model.chat_stream(
    tokenizer,
    query="Explain neural networks",
    history=history,
    system="You are a helpful assistant."
):
    print(partial_response, end="", flush=True)

Parameters

Same as chat() method.

Yields

partial_response

str

Incrementally generated response text. Each yield contains the full response up to the current point (not just the delta).

Multi-turn Conversation Example

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

# Initialize conversation
history = []
system = "You are a helpful AI assistant."

# First turn
response, history = model.chat(
    tokenizer,
    "Hello! Who are you?",
    history=history,
    system=system
)
print(f"Assistant: {response}")

# Second turn (with context)
response, history = model.chat(
    tokenizer,
    "What can you help me with?",
    history=history,
    system=system
)
print(f"Assistant: {response}")

# History now contains both exchanges
print(f"History length: {len(history)}")

Streaming Response Example

import sys

query = "Write a short poem about AI"

for response in model.chat_stream(
    tokenizer,
    query,
    history=history,
    generation_config=generation_config
):
    # Clear and rewrite output
    sys.stdout.write('\r' + ' ' * 80 + '\r')
    sys.stdout.write(response)
    sys.stdout.flush()

print()  # New line after completion

Custom System Prompts

# Technical expert
system = "You are an expert software engineer specializing in Python."

response, history = model.chat(
    tokenizer,
    "How do I optimize this code?",
    system=system
)

# Creative writing
system = "You are a creative writing assistant who helps with storytelling."

response, history = model.chat(
    tokenizer,
    "Help me write a story about space exploration",
    system=system
)

Using Stop Words

# Stop generation at specific sequences
stop_words = ["Observation:", "<|endoftext|>"]
stop_words_ids = [tokenizer.encode(s) for s in stop_words]

response, history = model.chat(
    tokenizer,
    query="Generate a function call",
    stop_words_ids=stop_words_ids
)

Generation with Parameters

response, history = model.chat(
    tokenizer,
    query="Tell me a creative story",
    history=history,
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    max_new_tokens=512
)

Chat Message Format

Internally, chat messages use the ChatML format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hi! How can I help you today?<|im_end|>

The chat() and chat_stream() methods handle this formatting automatically.

Model API

OpenAI Compatible API

Training API

Overview

chat() Method

Parameters

Returns

chat_stream() Method

Parameters

Yields

Multi-turn Conversation Example

Streaming Response Example

Custom System Prompts

Using Stop Words

Generation with Parameters

Chat Message Format

Build docs developers (and LLMs) love

Model API

OpenAI Compatible API

Training API

​Overview

​chat() Method

​Parameters

​Returns

​chat_stream() Method

​Parameters

​Yields

​Multi-turn Conversation Example

​Streaming Response Example

​Custom System Prompts

​Using Stop Words

​Generation with Parameters

​Chat Message Format

Build docs developers (and LLMs) love

Overview

chat() Method

Parameters

Returns

chat_stream() Method

Parameters

Yields

Multi-turn Conversation Example

Streaming Response Example

Custom System Prompts

Using Stop Words

Generation with Parameters

Chat Message Format