sglang generate

Overview

The sglang generate command runs inference on multimodal diffusion models. This command is currently supported only for diffusion models and provides a convenient way to generate images, videos, or other outputs without starting a server.

Basic Usage

sglang generate --model-path <model-name-or-path> --prompt "your prompt" [options]

Alternatively, you can use a configuration file:

sglang generate --config config.json

Required Arguments

--model-path

string

required

Path or name of the diffusion model to use. Can be:

HuggingFace model ID (e.g., stabilityai/stable-diffusion-xl-base-1.0)
Local path to model directory
ModelScope model ID (when using SGLANG_USE_MODELSCOPE=1)

--prompt

string

required

Text prompt describing what to generate.

Model Configuration

--config

string

Path to a JSON or YAML configuration file containing model and generation parameters. When provided, --model-path and --prompt become optional.

--model-id

string

Explicit model ID override (e.g., “Qwen-Image”).

--backend

string

default:"auto"

Model backend to use. Options:

auto: Automatically select backend (prefer sglang native, fallback to diffusers)
sglang: Use sglang’s native optimized implementation
diffusers: Use vanilla diffusers pipeline (supports all diffusers models)

--trust-remote-code

boolean

default:"false"

Trust remote code from HuggingFace.

--revision

string

Model revision (branch/tag name or commit ID).

Sampling Parameters

Generation Settings

--negative-prompt

string

Negative prompt to guide what not to generate.

--num-inference-steps

integer

default:"50"

Number of denoising steps. More steps generally produce higher quality but take longer.

--guidance-scale

float

default:"7.5"

Guidance scale for classifier-free guidance. Higher values follow the prompt more closely.

--height

integer

Output height in pixels.

--width

integer

Output width in pixels.

--seed

integer

Random seed for reproducibility.

Batch Generation

--num-samples

integer

default:"1"

Number of samples to generate.

Parallelism Options

--num-gpus

integer

default:"1"

Number of GPUs to use for inference.

--tp-size

integer

Tensor parallelism size.

--sp-degree

integer

Sequence parallelism degree.

--ulysses-degree

integer

Ulysses sequence parallelism degree for long sequences.

--ring-degree

integer

Ring sequence parallelism degree.

--dp-size

integer

default:"1"

Data parallelism size (number of data parallel groups).

--dp-degree

integer

default:"1"

Number of GPUs in a data parallel group.

--enable-cfg-parallel

boolean

default:"false"

Enable classifier-free guidance parallelism.

Attention Backend

--attention-backend

string

Attention backend to use for the model.

--attention-backend-config

string

Additional configuration for the attention backend (JSON format).

--cache-dit-config

string

Cache-DIT configuration for diffusers backend.

CPU Offloading

--dit-cpu-offload

boolean

Offload DiT (Diffusion Transformer) model to CPU to save GPU memory.

--dit-layerwise-offload

boolean

Enable layer-wise offloading for DiT model.

--text-encoder-cpu-offload

boolean

Offload text encoder to CPU.

--image-encoder-cpu-offload

boolean

Offload image encoder to CPU.

--vae-cpu-offload

boolean

Offload VAE (Variational AutoEncoder) to CPU.

LoRA Adapters

--lora-path

string

Path to LoRA adapter weights.

--lora-nickname

string

default:"default"

Nickname for the LoRA adapter (for swapping adapters in the pipeline).

--lora-scale

float

default:"1.0"

LoRA scale for merging (e.g., 0.125 for Hyper-SD).

--lora-target-modules

string

List of module names to apply LoRA to (e.g., “q_proj,k_proj”).

Quantization

--transformer-weights-path

string

Path to pre-quantized transformer weights (single .safetensors file or directory).

--nunchaku-config

string

Nunchaku SVDQuant configuration for model quantization.

Performance Options

--enable-torch-compile

boolean

default:"false"

Enable PyTorch compilation for faster inference.

--warmup

boolean

default:"false"

Run warmup iterations before generation.

--warmup-steps

integer

default:"1"

Number of warmup steps to run.

--disable-autocast

boolean

Disable automatic mixed precision.

Output Options

--output-path

string

default:"outputs/"

Directory path to save generated outputs.

--perf-dump-path

string

Path to dump performance metrics (JSON) for the run.

Advanced Options

--diffusers-kwargs

string

Additional keyword arguments to pass to the diffusers pipeline (JSON format).Example: --diffusers-kwargs '{"eta": 0.5, "use_karras_sigmas": true}'

--component-paths

string

Override paths for specific pipeline components (JSON format).Example: --component-paths '{"vae": "path/to/custom/vae"}'

--pipeline-class-name

string

Override the pipeline class from model_index.json.

Examples

Basic Image Generation

sglang generate \
  --model-path stabilityai/stable-diffusion-xl-base-1.0 \
  --prompt "A serene landscape with mountains and a lake at sunset"

High-Quality Generation with Custom Settings

sglang generate \
  --model-path stabilityai/stable-diffusion-xl-base-1.0 \
  --prompt "A futuristic city with flying cars" \
  --negative-prompt "blurry, low quality, distorted" \
  --num-inference-steps 100 \
  --guidance-scale 9.0 \
  --height 1024 \
  --width 1024 \
  --seed 42

Multi-GPU Inference

sglang generate \
  --model-path stabilityai/stable-diffusion-xl-base-1.0 \
  --prompt "A beautiful forest scene" \
  --num-gpus 4 \
  --sp-degree 2 \
  --tp-size 2

Batch Generation

sglang generate \
  --model-path stabilityai/stable-diffusion-xl-base-1.0 \
  --prompt "Abstract art with vibrant colors" \
  --num-samples 4 \
  --seed 42

Using LoRA Adapters

sglang generate \
  --model-path stabilityai/stable-diffusion-xl-base-1.0 \
  --prompt "Anime style character portrait" \
  --lora-path path/to/anime-lora \
  --lora-scale 0.8

CPU Offloading for Large Models

sglang generate \
  --model-path stabilityai/stable-diffusion-xl-base-1.0 \
  --prompt "A detailed photograph of nature" \
  --dit-cpu-offload \
  --vae-cpu-offload \
  --text-encoder-cpu-offload

Using Configuration File

Create a config file generation_config.json:

{
  "model_path": "stabilityai/stable-diffusion-xl-base-1.0",
  "prompt": "A majestic dragon flying over mountains",
  "negative_prompt": "blurry, low quality",
  "num_inference_steps": 50,
  "guidance_scale": 7.5,
  "height": 1024,
  "width": 1024,
  "seed": 12345,
  "num_samples": 2
}

Then run:

sglang generate --config generation_config.json

Performance Benchmarking

sglang generate \
  --model-path stabilityai/stable-diffusion-xl-base-1.0 \
  --prompt "Test image" \
  --perf-dump-path performance_metrics.json \
  --warmup \
  --warmup-steps 3

Output

Generated outputs are saved to the specified output directory (default: outputs/). The command will display generation progress and save:

Generated images/videos in the output directory
Performance metrics (if --perf-dump-path is specified)

Example output:

INFO: Loading model: stabilityai/stable-diffusion-xl-base-1.0
INFO: Model loaded successfully
INFO: Generating with prompt: "A serene landscape..."
INFO: Progress: 100% [50/50 steps]
INFO: Generated image saved to: outputs/generated_image_0.png
INFO: Total generation time: 3.45s

Limitations

The generate command is currently only supported for diffusion models. For language models, use the sglang serve command to start a server and make API requests.

Help

To see all available options:

sglang generate --help

sglang serve - Launch the SGLang server
sglang version - Show version information

Python API

Frontend API

HTTP API

CLI Reference

sglang generate

Overview

Basic Usage

Required Arguments

Model Configuration

Sampling Parameters

Generation Settings

Batch Generation

Parallelism Options

Attention Backend

CPU Offloading

LoRA Adapters

Quantization

Performance Options

Output Options

Advanced Options

Examples

Basic Image Generation

High-Quality Generation with Custom Settings

Multi-GPU Inference

Batch Generation

Using LoRA Adapters

CPU Offloading for Large Models

Using Configuration File

Performance Benchmarking

Output

Limitations

Help

Python API

Frontend API

HTTP API

CLI Reference

​Overview

​Basic Usage

​Required Arguments

​Model Configuration

​Sampling Parameters

​Generation Settings

​Batch Generation

​Parallelism Options

​Attention Backend

​CPU Offloading

​LoRA Adapters

​Quantization

​Performance Options

​Output Options

​Advanced Options

​Examples

​Basic Image Generation

​High-Quality Generation with Custom Settings

​Multi-GPU Inference

​Batch Generation

​Using LoRA Adapters

​CPU Offloading for Large Models

​Using Configuration File

​Performance Benchmarking

​Output

​Limitations

​Help

​Related Commands

Overview

Basic Usage

Required Arguments

Model Configuration

Sampling Parameters

Generation Settings

Batch Generation

Parallelism Options

Attention Backend

CPU Offloading

LoRA Adapters

Quantization

Performance Options

Output Options

Advanced Options

Examples

Basic Image Generation

High-Quality Generation with Custom Settings

Multi-GPU Inference

Batch Generation

Using LoRA Adapters

CPU Offloading for Large Models

Using Configuration File

Performance Benchmarking

Output

Limitations

Help

Related Commands