FastChat Deployment

FastChat is a comprehensive platform for deploying LLMs with web UI, REST API, and distributed serving capabilities. When combined with vLLM, it provides production-grade performance with an intuitive interface.

Overview

FastChat provides a three-component architecture:

Controller

Manages distributed workers and routes requests

Model Worker

Loads and serves the model (can use vLLM backend)

API/UI Server

Provides web interface or OpenAI-compatible API

Installation

pip install "fschat[model_worker,webui]==0.2.33" "openai<1.0" vllm

FastChat 0.2.33 is the recommended version for stability with Qwen models.

Quick Start

Web UI Deployment

Start the Controller

The controller manages model workers:

python -m fastchat.serve.controller

Runs on http://localhost:21001 by default.

Launch Model Worker

Start vLLM worker for high performance:

python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16

Start Web Server

Launch the Gradio web interface:

python -m fastchat.serve.gradio_web_server

Access at http://localhost:7860

OpenAI API Deployment

Start Controller

python -m fastchat.serve.controller

Launch vLLM Worker

python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16

Start API Server

python -m fastchat.serve.openai_api_server \
  --host localhost \
  --port 8000

Configuration

Worker Configuration

python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16

Worker Parameters

--model-path

string

required

Path to model checkpoint (HuggingFace or local path)

--trust-remote-code

boolean

required

Required for Qwen models

--tensor-parallel-size

int

default:"1"

Number of GPUs for tensor parallelism

--dtype

string

default:"auto"

Model data type: auto, bfloat16, float16, float32

--gpu-memory-utilization

float

default:"0.90"

Fraction of GPU memory to use

--max-num-seqs

int

default:"256"

Maximum concurrent sequences

--worker-address

string

Worker listening address

--controller-address

string

Controller address to register with

API Server Configuration

python -m fastchat.serve.openai_api_server \
  --host 0.0.0.0 \
  --port 8000 \
  --controller-address http://localhost:21001 \
  --api-keys sk-key1 sk-key2

--host

string

default:"localhost"

API server bind address

--port

int

default:"8000"

API server port

--controller-address

string

Address of the controller service

--api-keys

string[]

List of valid API keys for authentication

API Usage

OpenAI Python Client

import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"  # Or your API key if configured

response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=2048
)

print(response.choices[0].message.content)

Unlike vLLM standalone mode, FastChat handles stop tokens automatically. No need to specify stop_token_ids.

Multi-Model Deployment

Deploy multiple models simultaneously:

Start Controller

python -m fastchat.serve.controller

Launch Multiple Workers

Start workers on different ports:

# Worker 1: Qwen-7B
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16 \
  --worker-address http://localhost:21002

# Worker 2: Qwen-14B
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-14B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --worker-address http://localhost:21003

# Worker 3: Qwen-72B
CUDA_VISIBLE_DEVICES=2,3,4,5 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --worker-address http://localhost:21004

Start API Server

python -m fastchat.serve.openai_api_server \
  --host 0.0.0.0 \
  --port 8000

Model Selection

Clients can specify which model to use:

import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"

# List available models
models = openai.Model.list()
print(models)

# Use specific model
response = openai.ChatCompletion.create(
    model="Qwen-7B-Chat",  # Or "Qwen-14B-Chat", "Qwen-72B-Chat"
    messages=[{"role": "user", "content": "Hello"}]
)

Production Deployment

Systemd Services

Create systemd service files for each component:

[Unit]
Description=FastChat Controller
After=network.target

[Service]
Type=simple
User=qwen
WorkingDirectory=/opt/qwen
Environment="PATH=/opt/qwen/venv/bin"
ExecStart=/opt/qwen/venv/bin/python -m fastchat.serve.controller
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Manage services:

# Enable and start all services
sudo systemctl daemon-reload
sudo systemctl enable fastchat-controller fastchat-worker fastchat-api
sudo systemctl start fastchat-controller
sleep 5
sudo systemctl start fastchat-worker
sleep 10
sudo systemctl start fastchat-api

# Check status
sudo systemctl status fastchat-controller
sudo systemctl status fastchat-worker
sudo systemctl status fastchat-api

# View logs
sudo journalctl -u fastchat-worker -f

Docker Compose

Complete deployment with Docker Compose:

docker-compose.yml

version: '3.8'

services:
  controller:
    image: qwenllm/qwen:cu121
    container_name: fastchat-controller
    command: python -m fastchat.serve.controller
    ports:
      - "21001:21001"
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:21001"]
      interval: 30s
      timeout: 10s
      retries: 3

  worker:
    image: qwenllm/qwen:cu121
    container_name: fastchat-worker
    command: >
      python -m fastchat.serve.vllm_worker
      --model-path /models/Qwen-7B-Chat
      --trust-remote-code
      --dtype bfloat16
      --controller-address http://controller:21001
    volumes:
      - /path/to/models:/models:ro
    depends_on:
      - controller
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  api-server:
    image: qwenllm/qwen:cu121
    container_name: fastchat-api
    command: >
      python -m fastchat.serve.openai_api_server
      --host 0.0.0.0
      --port 8000
      --controller-address http://controller:21001
    ports:
      - "8000:8000"
    depends_on:
      - controller
      - worker
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/v1/models"]
      interval: 30s
      timeout: 10s
      retries: 3

  web-server:
    image: qwenllm/qwen:cu121
    container_name: fastchat-web
    command: >
      python -m fastchat.serve.gradio_web_server
      --controller-address http://controller:21001
    ports:
      - "7860:7860"
    depends_on:
      - controller
      - worker
    restart: always

Launch:

docker-compose up -d

# Scale workers
docker-compose up -d --scale worker=3

# View logs
docker-compose logs -f worker

Load Balancing

FastChat controller automatically load balances across multiple workers:

# Start controller
python -m fastchat.serve.controller

# Start multiple workers for same model (horizontal scaling)
CUDA_VISIBLE_DEVICES=0 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --worker-address http://localhost:21002

CUDA_VISIBLE_DEVICES=1 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --worker-address http://localhost:21003

CUDA_VISIBLE_DEVICES=2 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --worker-address http://localhost:21004

# Start API server
python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000

The controller distributes requests across workers automatically.

Monitoring

Worker Status

Check registered workers:

curl http://localhost:21001/list_models

Health Checks

# API server health
curl http://localhost:8000/v1/models

# Test inference
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen",
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 10
  }'

Logging

Enable detailed logging:

# Set log level
export FASTCHAT_LOG_LEVEL=DEBUG

# Run with logging
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code 2>&1 | tee worker.log

Troubleshooting

Worker registration fails

Error: Worker not appearing in controllerSolutions:

Check controller is running: curl http://localhost:21001
Verify controller address in worker: --controller-address http://localhost:21001
Check network connectivity between services
Review logs for connection errors

API returns 'No available models'

Error: API returns empty model listSolutions:

Ensure workers have registered successfully
Check controller status: curl http://localhost:21001/list_models
Wait for model loading to complete (can take minutes)
Check worker logs for errors

Slow response times

Issue: High latency in responsesSolutions:

Use vLLM worker instead of model_worker
Increase --max-num-seqs on worker
Add more workers for horizontal scaling
Enable tensor parallelism for large models
Use quantized models (Int4/Int8)

Worker crashes

Error: Worker process exits unexpectedlySolutions:

Check GPU memory: nvidia-smi
Reduce --gpu-memory-utilization
Use smaller model or quantized version
Check CUDA compatibility
Review system logs: dmesg | grep -i error

Advanced Features

Custom System Prompts

Set system prompts in the web UI or API:

import openai

response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {
            "role": "system", 
            "content": "You are an expert Python programmer. Always provide code examples."
        },
        {
            "role": "user", 
            "content": "How do I read a CSV file?"
        }
    ]
)

Conversation History

Maintain multi-turn conversations:

import openai

history = []

def chat(user_message):
    history.append({"role": "user", "content": user_message})
    
    response = openai.ChatCompletion.create(
        model="Qwen",
        messages=history
    )
    
    assistant_message = response.choices[0].message.content
    history.append({"role": "assistant", "content": assistant_message})
    
    return assistant_message

# Multi-turn conversation
print(chat("What is Python?"))
print(chat("What are its main features?"))
print(chat("Give me an example"))

Performance Comparison

FastChat vs Standalone

Feature	Standalone vLLM	FastChat + vLLM
Performance	Same	Same
Web UI	No	Yes
Multi-model	Manual	Automatic
Load Balancing	External	Built-in
Setup Complexity	Low	Medium
Production Ready	Yes	Yes

Next Steps

Production Guide

Best practices for production deployments

Monitoring

Set up comprehensive monitoring

Performance Tuning

Advanced optimization techniques

API Reference

Complete API documentation

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Overview

​Installation

​Quick Start

​Web UI Deployment

​OpenAI API Deployment

​Configuration

​Worker Configuration

​Worker Parameters

​API Server Configuration

​API Usage

​OpenAI Python Client

​Multi-Model Deployment

​Model Selection

​Production Deployment

​Systemd Services

​Docker Compose

​Load Balancing

​Monitoring

​Worker Status

​Health Checks

​Logging

​Troubleshooting

​Advanced Features

​Custom System Prompts

​Conversation History

​Performance Comparison

​FastChat vs Standalone

​Next Steps

Production Guide

Monitoring

Performance Tuning

API Reference

Build docs developers (and LLMs) love

Overview

Installation

Quick Start

Web UI Deployment

OpenAI API Deployment

Configuration

Worker Configuration

Worker Parameters

API Server Configuration

API Usage

OpenAI Python Client

Multi-Model Deployment

Model Selection

Production Deployment

Systemd Services

Docker Compose

Load Balancing

Monitoring

Worker Status

Health Checks

Logging

Troubleshooting

Advanced Features

Custom System Prompts

Conversation History

Performance Comparison

FastChat vs Standalone

Next Steps