FastChat is a comprehensive platform for deploying LLMs with web UI, REST API, and distributed serving capabilities. When combined with vLLM, it provides production-grade performance with an intuitive interface.
Overview
FastChat provides a three-component architecture:
Controller
Manages distributed workers and routes requests
Model Worker
Loads and serves the model (can use vLLM backend)
API/UI Server
Provides web interface or OpenAI-compatible API
Installation
Full Installation
Minimal Installation
Docker
pip install "fschat[model_worker,webui]==0.2.33" "openai<1.0" vllm
FastChat 0.2.33 is the recommended version for stability with Qwen models.
Quick Start
Web UI Deployment
Start the Controller
The controller manages model workers: python -m fastchat.serve.controller
Runs on http://localhost:21001 by default.
Launch Model Worker
Start vLLM worker for high performance: python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-7B-Chat \
--trust-remote-code \
--dtype bfloat16
Start Web Server
Launch the Gradio web interface: python -m fastchat.serve.gradio_web_server
Access at http://localhost:7860
OpenAI API Deployment
Start Controller
python -m fastchat.serve.controller
Launch vLLM Worker
python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-7B-Chat \
--trust-remote-code \
--dtype bfloat16
Start API Server
python -m fastchat.serve.openai_api_server \
--host localhost \
--port 8000
Configuration
Worker Configuration
Single GPU
Multi-GPU (4 GPUs)
Int4 Model
Custom Configuration
python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-7B-Chat \
--trust-remote-code \
--dtype bfloat16
Worker Parameters
Path to model checkpoint (HuggingFace or local path)
Number of GPUs for tensor parallelism
Model data type: auto, bfloat16, float16, float32
Fraction of GPU memory to use
Maximum concurrent sequences
Controller address to register with
API Server Configuration
python -m fastchat.serve.openai_api_server \
--host 0.0.0.0 \
--port 8000 \
--controller-address http://localhost:21001 \
--api-keys sk-key1 sk-key2
--host
string
default: "localhost"
API server bind address
Address of the controller service
List of valid API keys for authentication
API Usage
OpenAI Python Client
Basic Chat
Streaming Response
With Stop Words
Function Calling
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none" # Or your API key if configured
response = openai.ChatCompletion.create(
model = "Qwen" ,
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant." },
{ "role" : "user" , "content" : "What is machine learning?" }
],
temperature = 0.7 ,
max_tokens = 2048
)
print (response.choices[ 0 ].message.content)
Unlike vLLM standalone mode, FastChat handles stop tokens automatically. No need to specify stop_token_ids.
Multi-Model Deployment
Deploy multiple models simultaneously:
Start Controller
python -m fastchat.serve.controller
Launch Multiple Workers
Start workers on different ports: # Worker 1: Qwen-7B
python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-7B-Chat \
--trust-remote-code \
--dtype bfloat16 \
--worker-address http://localhost:21002
# Worker 2: Qwen-14B
python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-14B-Chat \
--trust-remote-code \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--worker-address http://localhost:21003
# Worker 3: Qwen-72B
CUDA_VISIBLE_DEVICES = 2,3,4,5 python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-72B-Chat \
--trust-remote-code \
--tensor-parallel-size 4 \
--dtype bfloat16 \
--worker-address http://localhost:21004
Start API Server
python -m fastchat.serve.openai_api_server \
--host 0.0.0.0 \
--port 8000
Model Selection
Clients can specify which model to use:
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"
# List available models
models = openai.Model.list()
print (models)
# Use specific model
response = openai.ChatCompletion.create(
model = "Qwen-7B-Chat" , # Or "Qwen-14B-Chat", "Qwen-72B-Chat"
messages = [{ "role" : "user" , "content" : "Hello" }]
)
Production Deployment
Systemd Services
Create systemd service files for each component:
fastchat-controller.service
fastchat-worker.service
fastchat-api.service
[Unit]
Description =FastChat Controller
After =network.target
[Service]
Type =simple
User =qwen
WorkingDirectory =/opt/qwen
Environment = "PATH=/opt/qwen/venv/bin"
ExecStart =/opt/qwen/venv/bin/python -m fastchat.serve.controller
Restart =always
RestartSec =10
[Install]
WantedBy =multi-user.target
Manage services:
# Enable and start all services
sudo systemctl daemon-reload
sudo systemctl enable fastchat-controller fastchat-worker fastchat-api
sudo systemctl start fastchat-controller
sleep 5
sudo systemctl start fastchat-worker
sleep 10
sudo systemctl start fastchat-api
# Check status
sudo systemctl status fastchat-controller
sudo systemctl status fastchat-worker
sudo systemctl status fastchat-api
# View logs
sudo journalctl -u fastchat-worker -f
Docker Compose
Complete deployment with Docker Compose:
version : '3.8'
services :
controller :
image : qwenllm/qwen:cu121
container_name : fastchat-controller
command : python -m fastchat.serve.controller
ports :
- "21001:21001"
restart : always
healthcheck :
test : [ "CMD" , "curl" , "-f" , "http://localhost:21001" ]
interval : 30s
timeout : 10s
retries : 3
worker :
image : qwenllm/qwen:cu121
container_name : fastchat-worker
command : >
python -m fastchat.serve.vllm_worker
--model-path /models/Qwen-7B-Chat
--trust-remote-code
--dtype bfloat16
--controller-address http://controller:21001
volumes :
- /path/to/models:/models:ro
depends_on :
- controller
restart : always
deploy :
resources :
reservations :
devices :
- driver : nvidia
count : 1
capabilities : [ gpu ]
api-server :
image : qwenllm/qwen:cu121
container_name : fastchat-api
command : >
python -m fastchat.serve.openai_api_server
--host 0.0.0.0
--port 8000
--controller-address http://controller:21001
ports :
- "8000:8000"
depends_on :
- controller
- worker
restart : always
healthcheck :
test : [ "CMD" , "curl" , "-f" , "http://localhost:8000/v1/models" ]
interval : 30s
timeout : 10s
retries : 3
web-server :
image : qwenllm/qwen:cu121
container_name : fastchat-web
command : >
python -m fastchat.serve.gradio_web_server
--controller-address http://controller:21001
ports :
- "7860:7860"
depends_on :
- controller
- worker
restart : always
Launch:
docker-compose up -d
# Scale workers
docker-compose up -d --scale worker= 3
# View logs
docker-compose logs -f worker
Load Balancing
FastChat controller automatically load balances across multiple workers:
# Start controller
python -m fastchat.serve.controller
# Start multiple workers for same model (horizontal scaling)
CUDA_VISIBLE_DEVICES = 0 python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-7B-Chat \
--trust-remote-code \
--worker-address http://localhost:21002
CUDA_VISIBLE_DEVICES = 1 python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-7B-Chat \
--trust-remote-code \
--worker-address http://localhost:21003
CUDA_VISIBLE_DEVICES = 2 python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-7B-Chat \
--trust-remote-code \
--worker-address http://localhost:21004
# Start API server
python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000
The controller distributes requests across workers automatically.
Monitoring
Worker Status
Check registered workers:
curl http://localhost:21001/list_models
Health Checks
# API server health
curl http://localhost:8000/v1/models
# Test inference
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen",
"messages": [{"role": "user", "content": "hello"}],
"max_tokens": 10
}'
Logging
Enable detailed logging:
# Set log level
export FASTCHAT_LOG_LEVEL = DEBUG
# Run with logging
python -m fastchat.serve.vllm_worker \
--model-path Qwen/Qwen-7B-Chat \
--trust-remote-code 2>&1 | tee worker.log
Troubleshooting
Worker registration fails
Error : Worker not appearing in controllerSolutions :
Check controller is running: curl http://localhost:21001
Verify controller address in worker: --controller-address http://localhost:21001
Check network connectivity between services
Review logs for connection errors
API returns 'No available models'
Error : API returns empty model listSolutions :
Ensure workers have registered successfully
Check controller status: curl http://localhost:21001/list_models
Wait for model loading to complete (can take minutes)
Check worker logs for errors
Issue : High latency in responsesSolutions :
Use vLLM worker instead of model_worker
Increase --max-num-seqs on worker
Add more workers for horizontal scaling
Enable tensor parallelism for large models
Use quantized models (Int4/Int8)
Error : Worker process exits unexpectedlySolutions :
Check GPU memory: nvidia-smi
Reduce --gpu-memory-utilization
Use smaller model or quantized version
Check CUDA compatibility
Review system logs: dmesg | grep -i error
Advanced Features
Custom System Prompts
Set system prompts in the web UI or API:
import openai
response = openai.ChatCompletion.create(
model = "Qwen" ,
messages = [
{
"role" : "system" ,
"content" : "You are an expert Python programmer. Always provide code examples."
},
{
"role" : "user" ,
"content" : "How do I read a CSV file?"
}
]
)
Conversation History
Maintain multi-turn conversations:
import openai
history = []
def chat ( user_message ):
history.append({ "role" : "user" , "content" : user_message})
response = openai.ChatCompletion.create(
model = "Qwen" ,
messages = history
)
assistant_message = response.choices[ 0 ].message.content
history.append({ "role" : "assistant" , "content" : assistant_message})
return assistant_message
# Multi-turn conversation
print (chat( "What is Python?" ))
print (chat( "What are its main features?" ))
print (chat( "Give me an example" ))
FastChat vs Standalone
Feature Standalone vLLM FastChat + vLLM Performance Same Same Web UI No Yes Multi-model Manual Automatic Load Balancing External Built-in Setup Complexity Low Medium Production Ready Yes Yes
Next Steps
Production Guide Best practices for production deployments
Monitoring Set up comprehensive monitoring
Performance Tuning Advanced optimization techniques
API Reference Complete API documentation