Stable Video Diffusion (SVD)
SVD is the foundation for image-to-video generation, converting static images into smooth video sequences.
Architecture
Model Configuration:- Model channels: 320
- In channels: 8
- Transformer depth: [1, 1, 1, 1, 1, 1, 0, 0]
- Context dimension: 1024
- ADM channels: 768
- Temporal attention: Enabled
- Temporal resblocks: Enabled
- Same as SD15 (4 channels, 8x spatial downscaling)
- Sigma range: 0.002 to 700.0
Usage
Loading the Model:SVD_img2vid_Conditioning node:
- Input image: Reference frame
- Width/Height: Output resolution (default 1024x576)
- Video frames: Number of frames (default 14)
- Motion bucket ID: Controls motion amount (1-1023, default 127)
- FPS: Frames per second (default 6)
- Augmentation level: Noise added to input (0-10, default 0)
- Use
VideoLinearCFGGuidancefor linear CFG scaling - Use
VideoTriangleCFGGuidancefor triangle wave CFG
Mochi
Architecture
Model Configuration:- Image model: mochi_preview
- Shift: 6.0, multiplier: 1.0
- Latent format: Mochi (12 channels)
- Memory usage factor: 2.0
- Supported dtypes: BF16, FP32
- Channels: 12
- Temporal compression: 6x (length = (frames-1)//6 + 1)
- Spatial compression: 8x
Text Encoder
- Model: T5-XXL (Mochi variant)
- Tokenizer: MochiT5Tokenizer
Usage
Creating Empty Latent:- Default: 848x480
- Must be multiples of 16
- Frame count: 7, 13, 19, 25, 31, 37, etc. (7 + 6n)
LTX-Video
Architecture
Model Configuration:- Image model: ltxv
- Shift: 2.37
- Latent format: LTXV (128 channels)
- Memory usage factor: 5.5 (scales with cross-attention dimension)
- Supported dtypes: BF16, FP32
- Channels: 128
- Temporal compression: 8x (length = (frames-1)//8 + 1)
- Spatial compression: 32x (height/32, width/32)
Text Encoder
- Model: T5-XXL (LTXV variant)
- Tokenizer: LTXVT5Tokenizer
Usage
Empty Latent Creation:Advanced Features
Adding Guide Frames:LTXAV (Audio-Video)
Architecture:- Image model: ltxav
- Latent format: LTXAV
- Memory usage factor: 0.077
- Combined audio-video generation
- Use
LTXVConcatAVLatentto combine video and audio latents - Use
LTXVSeparateAVLatentto split them back
Hunyuan Video
Architecture
Model Configuration:- Image model: hunyuan_video
- Shift: 7.0
- Latent format: HunyuanVideo
- Memory usage factor: 1.8
- Supported dtypes: BF16, FP32
Text Encoder
- Model: Llama-based (multilingual)
- Tokenizer: HunyuanVideoTokenizer
- Excellent Chinese and English support
Variants
Hunyuan Video Base:- Text-to-video generation
- Standard input channels
- In channels: 33 (includes image conditioning)
- Converts images to videos
- In channels: 32
- Specialized I2V variant
- Latest version with improved quality
- Hunyuan Video 1.5 Tutorial
Usage
Model Files:Wan 2.1
Architecture
Model Configuration:- Image model: wan2.1
- Latent format: Wan21
- Memory usage factor: 0.9 (scales with model dimension)
- Shift: 8.0
- Supported dtypes: FP16, BF16, FP32
Text Encoder
- Model: UMT5-XXL
- Tokenizer: WanT5Tokenizer
Variants
WAN21_T2V (Text-to-Video):- Model type: t2v
- Standard text-to-video generation
- Model type: i2v
- In dimension: 36
- Image-to-video conversion
- Model type: i2v
- In dimension: 48
- Advanced control features
- Model type: camera
- In dimension: 32
- Camera motion control
- Model type: vace
- Memory factor: 1.2x higher
- Video acceleration/editing
- Model type: humo
- Human motion generation
- Model type: flow_rvs
- Flow-based reverse sampling
- Image-to-video enabled
- Model type: scail
- Scaling and interpolation
Wan 2.2
Architecture Updates
Model Configuration:- Latent format: Wan22 (48 output channels)
- Additional specialized models
New Variants
WAN22_T2V:- Out dimension: 48
- Improved output quality
- Image-to-video enabled
- Model type: camera_2.2
- In dimension: 36
- Enhanced camera control
- Model type: s2v
- Speech-driven video generation
- Model type: animate
- Character animation specialized
Cosmos
Cosmos T2V (Text-to-Video)
Architecture:- Image model: cosmos
- In channels: 16
- Latent format: Cosmos1CV8x8x8
- Memory usage factor: 1.6
- Sigma data: 0.5, max: 80.0, min: 0.002
- Supported dtypes: BF16, FP16, FP32
- T5-XXL (Cosmos variant)
Cosmos I2V (Image-to-Video)
Architecture:- In channels: 17 (includes image)
- Same base architecture as T2V
Cosmos T2I Predict2 (Text-to-Image)
Architecture:- Image model: cosmos_predict2
- In channels: 16
- Latent format: Wan21
- Memory usage factor: 1.0 (scales with model channels)
- Sigma data: 1.0
- Supported dtypes: BF16, FP16, FP32
Cosmos I2V Predict2
Architecture:- In channels: 17
- Image-to-video enabled
Anima
Architecture
Model Configuration:- Image model: anima
- Shift: 3.0, multiplier: 1.0
- Latent format: Wan21
- Memory usage factor: 0.95 (scales with model channels)
- Supported dtypes: BF16, FP16, FP32
- Qwen3-0.6B
- Lightweight but effective
- FP16 uses 1.4x more memory than BF16/FP32
Chroma
Architecture
Model Configuration:- Image model: chroma
- Latent format: Flux
- Memory usage factor: 3.2
- Multiplier: 1.0
- Supported dtypes: BF16, FP16, FP32
- T5-XXL (PixArt variant)
Chroma Radiance (Pixel Space)
Architecture:- Image model: chroma_radiance
- Latent format: ChromaRadiance
- Memory usage factor: 0.044 (extremely efficient)
- No spatial compression - operates on raw RGB patches
- No VAE required
Model Files
Default Locations
File Naming Conventions
Mochi:- DiffusionModel:
mochi_preview_dit.safetensors - VAE:
mochi_preview_vae.safetensors - Text encoder:
t5xxl_fp16.safetensors
- Combined file or separate components
- VAE:
ltxv_vae.safetensors
- DiT:
hunyuan_video_dit.safetensors - VAE:
hunyuan_video_vae.safetensors - Text encoder:
llama3_8b.safetensors
Performance Optimization
Memory Management
For 12GB GPUs:Resolution Guidelines
SVD:- Recommended: 1024x576 or 768x768
- Max: 1280x720 (requires more VRAM)
- Recommended: 848x480
- Max: 1280x720 (with VRAM optimization)
- Recommended: 768x512 for longer videos
- Max: 1024x768 for shorter clips
- Recommended: 720x480 or 960x544
- Supports various aspect ratios
Speed Optimization
Temporal Attention:- Video models benefit from
--use-pytorch-cross-attention - AMD GPUs: Try
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
- Start with minimum frames (14 for SVD, 25 for Mochi)
- Increase gradually based on VRAM availability
- Lower CFG reduces memory usage
- Use specialized guidance nodes for video (VideoLinearCFGGuidance)
Advanced Techniques
Video CFG Scheduling
Linear CFG:Context Windows
- Use
ConditioningSetAreaPercentageVideofor temporal conditioning - Control spatial and temporal regions independently
Batching
- Video models typically use batch_size=1
- Multiple batches possible with sufficient VRAM
- Consider temporal consistency when batching