Skip to main content

Overview

The Qwen Chat API provides methods for conversational interactions with the model. It supports both synchronous and streaming responses, multi-turn conversations with history, and custom system prompts.

chat() Method

Generate a complete response for a user query:
response, updated_history = model.chat(
    tokenizer,
    query="What is quantum computing?",
    history=None,
    system="You are a helpful assistant."
)

print(response)

Parameters

tokenizer
AutoTokenizer
required
Tokenizer instance for encoding/decoding text
query
str
required
User’s current message or question
history
list[tuple[str, str]]
default:"None"
Conversation history as list of (user_message, assistant_response) tuples:
history = [
    ("Hello", "Hi! How can I help you today?"),
    ("What's the weather?", "I don't have access to weather data.")
]
system
str
default:"You are a helpful assistant."
System prompt defining the assistant’s behavior and role
stop_words_ids
list[list[int]]
default:"None"
Token ID sequences that trigger generation termination:
stop_words_ids = [
    tokenizer.encode("<|im_end|>"),
    tokenizer.encode("\n\n")
]
**gen_kwargs
dict
Additional generation parameters (see GenerationConfig)

Returns

response
str
The model’s generated response text
history
list[tuple[str, str]]
Updated conversation history including the current exchange

chat_stream() Method

Generate a streaming response for real-time display:
for partial_response in model.chat_stream(
    tokenizer,
    query="Explain neural networks",
    history=history,
    system="You are a helpful assistant."
):
    print(partial_response, end="", flush=True)

Parameters

Same as chat() method.

Yields

partial_response
str
Incrementally generated response text. Each yield contains the full response up to the current point (not just the delta).

Multi-turn Conversation Example

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

# Initialize conversation
history = []
system = "You are a helpful AI assistant."

# First turn
response, history = model.chat(
    tokenizer,
    "Hello! Who are you?",
    history=history,
    system=system
)
print(f"Assistant: {response}")

# Second turn (with context)
response, history = model.chat(
    tokenizer,
    "What can you help me with?",
    history=history,
    system=system
)
print(f"Assistant: {response}")

# History now contains both exchanges
print(f"History length: {len(history)}")

Streaming Response Example

import sys

query = "Write a short poem about AI"

for response in model.chat_stream(
    tokenizer,
    query,
    history=history,
    generation_config=generation_config
):
    # Clear and rewrite output
    sys.stdout.write('\r' + ' ' * 80 + '\r')
    sys.stdout.write(response)
    sys.stdout.flush()

print()  # New line after completion

Custom System Prompts

# Technical expert
system = "You are an expert software engineer specializing in Python."

response, history = model.chat(
    tokenizer,
    "How do I optimize this code?",
    system=system
)

# Creative writing
system = "You are a creative writing assistant who helps with storytelling."

response, history = model.chat(
    tokenizer,
    "Help me write a story about space exploration",
    system=system
)

Using Stop Words

# Stop generation at specific sequences
stop_words = ["Observation:", "<|endoftext|>"]
stop_words_ids = [tokenizer.encode(s) for s in stop_words]

response, history = model.chat(
    tokenizer,
    query="Generate a function call",
    stop_words_ids=stop_words_ids
)

Generation with Parameters

response, history = model.chat(
    tokenizer,
    query="Tell me a creative story",
    history=history,
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    max_new_tokens=512
)

Chat Message Format

Internally, chat messages use the ChatML format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hi! How can I help you today?<|im_end|>
The chat() and chat_stream() methods handle this formatting automatically.

Build docs developers (and LLMs) love