Quick Start
Launch the web server with default settings:http://localhost:8000 and print the URL to the console.
Multi-GPU Support
The web server uses data parallelism to distribute requests across multiple GPUs. Each GPU loads a full copy of the model, and incoming requests are distributed to available workers.Single GPU (default)
Multiple GPUs
Server Configuration
Model Selection
Default Generation Parameters
Network Configuration
Device and Precision
API Endpoints
The server exposes the following REST API endpoints:Chat UI
Chat Completions (Streaming)
Health Check
Statistics
Abuse Prevention
The server includes built-in limits to prevent abuse:- Maximum 500 messages per request
- Maximum 8,000 characters per message
- Maximum 32,000 characters total conversation length
- Temperature clamped to 0.0-2.0
- Top-k clamped to 0-200 (0 disables top-k, using full vocabulary)
- Max tokens clamped to 1-4,096
scripts/chat_web.py:52-61:
Complete Examples
Production Multi-GPU Deployment
- RL model tagged “production-v3”
- Temperature 0.7
- Top-k 50
- Max tokens 1024
- Port 8000
- Accessible from all network interfaces
Local Development Server
- SFT model
- Temperature 1.0 (more creative)
- Port 8080
- Localhost only
CPU-Only Server
Technical Details
Worker Pool Architecture
The server uses an async worker pool to manage concurrent requests across GPUs. Fromscripts/chat_web.py:98-148:
Streaming with UTF-8 Handling
The server properly handles multi-byte UTF-8 characters (like emojis) by accumulating tokens and only yielding when the decoded string is valid. Fromscripts/chat_web.py:277-309:
All Flags Reference
| Flag | Short | Type | Default | Description |
|---|---|---|---|---|
--num-gpus | -n | int | 1 | Number of GPUs to use |
--source | -i | str | sft | Model source: sft or rl |
--temperature | -t | float | 0.8 | Default temperature |
--top-k | -k | int | 50 | Default top-k sampling |
--max-tokens | -m | int | 512 | Default max tokens |
--model-tag | -g | str | None | Model tag to load |
--step | -s | int | None | Training step to load |
--port | -p | int | 8000 | Server port |
--dtype | -d | str | bfloat16 | Precision: float32 or bfloat16 |
--device-type | str | auto | Device: cuda, cpu, or mps | |
--host | str | 0.0.0.0 | Host to bind to |