Overview
Thesglang generate command runs inference on multimodal diffusion models. This command is currently supported only for diffusion models and provides a convenient way to generate images, videos, or other outputs without starting a server.
Basic Usage
Required Arguments
Path or name of the diffusion model to use. Can be:
- HuggingFace model ID (e.g.,
stabilityai/stable-diffusion-xl-base-1.0) - Local path to model directory
- ModelScope model ID (when using
SGLANG_USE_MODELSCOPE=1)
Text prompt describing what to generate.
Model Configuration
Path to a JSON or YAML configuration file containing model and generation parameters. When provided,
--model-path and --prompt become optional.Explicit model ID override (e.g., “Qwen-Image”).
Model backend to use. Options:
auto: Automatically select backend (prefer sglang native, fallback to diffusers)sglang: Use sglang’s native optimized implementationdiffusers: Use vanilla diffusers pipeline (supports all diffusers models)
Trust remote code from HuggingFace.
Model revision (branch/tag name or commit ID).
Sampling Parameters
Generation Settings
Negative prompt to guide what not to generate.
Number of denoising steps. More steps generally produce higher quality but take longer.
Guidance scale for classifier-free guidance. Higher values follow the prompt more closely.
Output height in pixels.
Output width in pixels.
Random seed for reproducibility.
Batch Generation
Number of samples to generate.
Parallelism Options
Number of GPUs to use for inference.
Tensor parallelism size.
Sequence parallelism degree.
Ulysses sequence parallelism degree for long sequences.
Ring sequence parallelism degree.
Data parallelism size (number of data parallel groups).
Number of GPUs in a data parallel group.
Enable classifier-free guidance parallelism.
Attention Backend
Attention backend to use for the model.
Additional configuration for the attention backend (JSON format).
Cache-DIT configuration for diffusers backend.
CPU Offloading
Offload DiT (Diffusion Transformer) model to CPU to save GPU memory.
Enable layer-wise offloading for DiT model.
Offload text encoder to CPU.
Offload image encoder to CPU.
Offload VAE (Variational AutoEncoder) to CPU.
LoRA Adapters
Path to LoRA adapter weights.
Nickname for the LoRA adapter (for swapping adapters in the pipeline).
LoRA scale for merging (e.g., 0.125 for Hyper-SD).
List of module names to apply LoRA to (e.g., “q_proj,k_proj”).
Quantization
Path to pre-quantized transformer weights (single .safetensors file or directory).
Nunchaku SVDQuant configuration for model quantization.
Performance Options
Enable PyTorch compilation for faster inference.
Run warmup iterations before generation.
Number of warmup steps to run.
Disable automatic mixed precision.
Output Options
Directory path to save generated outputs.
Path to dump performance metrics (JSON) for the run.
Advanced Options
Additional keyword arguments to pass to the diffusers pipeline (JSON format).Example:
--diffusers-kwargs '{"eta": 0.5, "use_karras_sigmas": true}'Override paths for specific pipeline components (JSON format).Example:
--component-paths '{"vae": "path/to/custom/vae"}'Override the pipeline class from model_index.json.
Examples
Basic Image Generation
High-Quality Generation with Custom Settings
Multi-GPU Inference
Batch Generation
Using LoRA Adapters
CPU Offloading for Large Models
Using Configuration File
Create a config filegeneration_config.json:
Performance Benchmarking
Output
Generated outputs are saved to the specified output directory (default:outputs/). The command will display generation progress and save:
- Generated images/videos in the output directory
- Performance metrics (if
--perf-dump-pathis specified)
Limitations
Help
To see all available options:Related Commands
- sglang serve - Launch the SGLang server
- sglang version - Show version information
