WanPipeline is a text-to-video and image-to-video generation pipeline using the WAN (World Action Network) transformer.
WanPipeline
Located insrc/maxdiffusion/pipelines/wan/wan_pipeline.py:191
Components
UMT5 tokenizer for text encoding
UMT5 text encoder (google/umt5-xxl)
Video VAE for encoding/decoding video frames
Cache for VAE intermediate activations
UniPC multistep scheduler for diffusion
Scheduler state
JAX mesh for distributed computation
CLIP image processor (for I2V models)
CLIP image encoder (for I2V models)
Methods
encode_prompt
Encodes text prompts using T5 encoder.
Parameters:
The prompt or prompts to guide video generation
Negative prompts for classifier-free guidance
Number of videos to generate per prompt
Maximum sequence length for text encoder
Encoded prompt embeddings
Encoded negative prompt embeddings
encode_image
Encodes images using CLIP encoder (for WAN 2.1 I2V).
Parameters:
Input image(s) for image-to-video generation
Number of videos per prompt
CLIP image embeddings
prepare_latents
Prepares initial latent tensors for video generation.
Parameters:
Batch size
Temporal downsampling factor
Spatial downsampling factor
Height of generated videos
Width of generated videos
Number of frames in the video
Number of channels in latent space
Random latent tensors for video generation
prepare_latents_i2v_base
Prepares latent conditioning for image-to-video generation.
Parameters:
Input image tensor
Number of frames to generate
Data type for latents
Optional last frame for video bookending
VAE encoded latents of the image(s)
Input to the VAE
Class methods
load_transformer
Loads the WAN transformer model with sharding.
Parameters:
Array of devices
JAX mesh
Random number generators
Configuration
Optional checkpoint to restore from
Loaded and sharded transformer model
load_vae
Loads the video VAE with sharding.
Returns:
Loaded video VAE
VAE cache for intermediate activations
quantize_transformer
Quantizes the transformer using Qwix.
Parameters:
Configuration with quantization settings
Model to quantize
Quantized model
Key features
- Video generation: Generates high-quality videos from text or images
- Temporal consistency: Uses 3D attention for temporal coherence
- Flow matching scheduler: Uses flow matching for efficient sampling
- Multiple model variants: Supports WAN 2.1 and 2.2 architectures
- I2V conditioning: Image-to-video via CLIP embeddings (2.1) or VAE latents (2.2)
- Quantization support: FP8/INT8 quantization via Qwix
Model variants
WAN 2.1
- Uses CLIP image encoder for I2V conditioning
- Image embeddings are passed to transformer
WAN 2.2
- Uses VAE latent conditioning for I2V
- No CLIP image encoder required