Skip to main content
FastChat is a comprehensive platform for deploying LLMs with web UI, REST API, and distributed serving capabilities. When combined with vLLM, it provides production-grade performance with an intuitive interface.

Overview

FastChat provides a three-component architecture:
1

Controller

Manages distributed workers and routes requests
2

Model Worker

Loads and serves the model (can use vLLM backend)
3

API/UI Server

Provides web interface or OpenAI-compatible API

Installation

pip install "fschat[model_worker,webui]==0.2.33" "openai<1.0" vllm
FastChat 0.2.33 is the recommended version for stability with Qwen models.

Quick Start

Web UI Deployment

1

Start the Controller

The controller manages model workers:
python -m fastchat.serve.controller
Runs on http://localhost:21001 by default.
2

Launch Model Worker

Start vLLM worker for high performance:
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16
3

Start Web Server

Launch the Gradio web interface:
python -m fastchat.serve.gradio_web_server
Access at http://localhost:7860

OpenAI API Deployment

1

Start Controller

python -m fastchat.serve.controller
2

Launch vLLM Worker

python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16
3

Start API Server

python -m fastchat.serve.openai_api_server \
  --host localhost \
  --port 8000

Configuration

Worker Configuration

python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16

Worker Parameters

--model-path
string
required
Path to model checkpoint (HuggingFace or local path)
--trust-remote-code
boolean
required
Required for Qwen models
--tensor-parallel-size
int
default:"1"
Number of GPUs for tensor parallelism
--dtype
string
default:"auto"
Model data type: auto, bfloat16, float16, float32
--gpu-memory-utilization
float
default:"0.90"
Fraction of GPU memory to use
--max-num-seqs
int
default:"256"
Maximum concurrent sequences
--worker-address
string
Worker listening address
--controller-address
string
Controller address to register with

API Server Configuration

python -m fastchat.serve.openai_api_server \
  --host 0.0.0.0 \
  --port 8000 \
  --controller-address http://localhost:21001 \
  --api-keys sk-key1 sk-key2
--host
string
default:"localhost"
API server bind address
--port
int
default:"8000"
API server port
--controller-address
string
Address of the controller service
--api-keys
string[]
List of valid API keys for authentication

API Usage

OpenAI Python Client

import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"  # Or your API key if configured

response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=2048
)

print(response.choices[0].message.content)
Unlike vLLM standalone mode, FastChat handles stop tokens automatically. No need to specify stop_token_ids.

Multi-Model Deployment

Deploy multiple models simultaneously:
1

Start Controller

python -m fastchat.serve.controller
2

Launch Multiple Workers

Start workers on different ports:
# Worker 1: Qwen-7B
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16 \
  --worker-address http://localhost:21002

# Worker 2: Qwen-14B
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-14B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --worker-address http://localhost:21003

# Worker 3: Qwen-72B
CUDA_VISIBLE_DEVICES=2,3,4,5 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --worker-address http://localhost:21004
3

Start API Server

python -m fastchat.serve.openai_api_server \
  --host 0.0.0.0 \
  --port 8000

Model Selection

Clients can specify which model to use:
import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"

# List available models
models = openai.Model.list()
print(models)

# Use specific model
response = openai.ChatCompletion.create(
    model="Qwen-7B-Chat",  # Or "Qwen-14B-Chat", "Qwen-72B-Chat"
    messages=[{"role": "user", "content": "Hello"}]
)

Production Deployment

Systemd Services

Create systemd service files for each component:
[Unit]
Description=FastChat Controller
After=network.target

[Service]
Type=simple
User=qwen
WorkingDirectory=/opt/qwen
Environment="PATH=/opt/qwen/venv/bin"
ExecStart=/opt/qwen/venv/bin/python -m fastchat.serve.controller
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Manage services:
# Enable and start all services
sudo systemctl daemon-reload
sudo systemctl enable fastchat-controller fastchat-worker fastchat-api
sudo systemctl start fastchat-controller
sleep 5
sudo systemctl start fastchat-worker
sleep 10
sudo systemctl start fastchat-api

# Check status
sudo systemctl status fastchat-controller
sudo systemctl status fastchat-worker
sudo systemctl status fastchat-api

# View logs
sudo journalctl -u fastchat-worker -f

Docker Compose

Complete deployment with Docker Compose:
docker-compose.yml
version: '3.8'

services:
  controller:
    image: qwenllm/qwen:cu121
    container_name: fastchat-controller
    command: python -m fastchat.serve.controller
    ports:
      - "21001:21001"
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:21001"]
      interval: 30s
      timeout: 10s
      retries: 3

  worker:
    image: qwenllm/qwen:cu121
    container_name: fastchat-worker
    command: >
      python -m fastchat.serve.vllm_worker
      --model-path /models/Qwen-7B-Chat
      --trust-remote-code
      --dtype bfloat16
      --controller-address http://controller:21001
    volumes:
      - /path/to/models:/models:ro
    depends_on:
      - controller
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  api-server:
    image: qwenllm/qwen:cu121
    container_name: fastchat-api
    command: >
      python -m fastchat.serve.openai_api_server
      --host 0.0.0.0
      --port 8000
      --controller-address http://controller:21001
    ports:
      - "8000:8000"
    depends_on:
      - controller
      - worker
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/v1/models"]
      interval: 30s
      timeout: 10s
      retries: 3

  web-server:
    image: qwenllm/qwen:cu121
    container_name: fastchat-web
    command: >
      python -m fastchat.serve.gradio_web_server
      --controller-address http://controller:21001
    ports:
      - "7860:7860"
    depends_on:
      - controller
      - worker
    restart: always
Launch:
docker-compose up -d

# Scale workers
docker-compose up -d --scale worker=3

# View logs
docker-compose logs -f worker

Load Balancing

FastChat controller automatically load balances across multiple workers:
# Start controller
python -m fastchat.serve.controller

# Start multiple workers for same model (horizontal scaling)
CUDA_VISIBLE_DEVICES=0 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --worker-address http://localhost:21002

CUDA_VISIBLE_DEVICES=1 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --worker-address http://localhost:21003

CUDA_VISIBLE_DEVICES=2 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --worker-address http://localhost:21004

# Start API server
python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000
The controller distributes requests across workers automatically.

Monitoring

Worker Status

Check registered workers:
curl http://localhost:21001/list_models

Health Checks

# API server health
curl http://localhost:8000/v1/models

# Test inference
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen",
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 10
  }'

Logging

Enable detailed logging:
# Set log level
export FASTCHAT_LOG_LEVEL=DEBUG

# Run with logging
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code 2>&1 | tee worker.log

Troubleshooting

Error: Worker not appearing in controllerSolutions:
  • Check controller is running: curl http://localhost:21001
  • Verify controller address in worker: --controller-address http://localhost:21001
  • Check network connectivity between services
  • Review logs for connection errors
Error: API returns empty model listSolutions:
  • Ensure workers have registered successfully
  • Check controller status: curl http://localhost:21001/list_models
  • Wait for model loading to complete (can take minutes)
  • Check worker logs for errors
Issue: High latency in responsesSolutions:
  • Use vLLM worker instead of model_worker
  • Increase --max-num-seqs on worker
  • Add more workers for horizontal scaling
  • Enable tensor parallelism for large models
  • Use quantized models (Int4/Int8)
Error: Worker process exits unexpectedlySolutions:
  • Check GPU memory: nvidia-smi
  • Reduce --gpu-memory-utilization
  • Use smaller model or quantized version
  • Check CUDA compatibility
  • Review system logs: dmesg | grep -i error

Advanced Features

Custom System Prompts

Set system prompts in the web UI or API:
import openai

response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {
            "role": "system", 
            "content": "You are an expert Python programmer. Always provide code examples."
        },
        {
            "role": "user", 
            "content": "How do I read a CSV file?"
        }
    ]
)

Conversation History

Maintain multi-turn conversations:
import openai

history = []

def chat(user_message):
    history.append({"role": "user", "content": user_message})
    
    response = openai.ChatCompletion.create(
        model="Qwen",
        messages=history
    )
    
    assistant_message = response.choices[0].message.content
    history.append({"role": "assistant", "content": assistant_message})
    
    return assistant_message

# Multi-turn conversation
print(chat("What is Python?"))
print(chat("What are its main features?"))
print(chat("Give me an example"))

Performance Comparison

FastChat vs Standalone

FeatureStandalone vLLMFastChat + vLLM
PerformanceSameSame
Web UINoYes
Multi-modelManualAutomatic
Load BalancingExternalBuilt-in
Setup ComplexityLowMedium
Production ReadyYesYes

Next Steps

Production Guide

Best practices for production deployments

Monitoring

Set up comprehensive monitoring

Performance Tuning

Advanced optimization techniques

API Reference

Complete API documentation

Build docs developers (and LLMs) love