VRAM Management
ComfyUI includes smart memory management that automatically optimizes VRAM usage. Understanding the different VRAM modes helps you choose the best configuration.VRAM State Modes
VRAM State Modes
ComfyUI operates in different VRAM states depending on your hardware:HIGH_VRAM Mode (Default for GPUs with sufficient VRAM):
- Keeps models in VRAM
- Fastest performance
- Recommended for GPUs with 12GB+ VRAM
- Balanced approach
- Moves models between RAM and VRAM as needed
- Works for most configurations
- Offloads models more aggressively
- Slower but works with limited VRAM
- Recommended for GPUs with 4-8GB VRAM
- Keeps minimal data in VRAM
- Very slow but works with < 2GB VRAM
- Last resort for extremely limited GPUs
- Never offload to CPU
- Useful when you want to keep everything on GPU
- Requires sufficient VRAM for your workflow
VRAM Reservation
VRAM Reservation
Reserve VRAM for other applications to prevent conflicts:This reserves 2GB of VRAM. Adjust based on your needs:
- Windows Default: 600MB (due to shared memory overhead)
- Linux Default: 400MB
- 16GB+ GPUs: Additional 100MB reserved automatically
Memory Pinning
Memory Pinning
ComfyUI automatically pins memory for faster transfers between RAM and VRAM on NVIDIA and AMD GPUs.Automatic pinning limits:Only disable if you experience crashes or memory issues.
- Windows: 45% of system RAM (OS limit ~50%)
- Linux: 95% of system RAM
Attention Mechanisms
Different attention implementations offer varying performance characteristics.PyTorch Attention (Recommended for modern GPUs)
PyTorch Attention (Recommended for modern GPUs)
PyTorch’s built-in attention is automatically enabled on NVIDIA GPUs with PyTorch 2.0+.Force enable:Automatically enabled for:
- NVIDIA GPUs with PyTorch 2.0+
- Intel XPU
- Ascend NPU
- Cambricon MLU
- Flash Attention support on compatible hardware
- Memory efficient attention
- Generally faster on modern GPUs
xformers Attention
xformers Attention
xformers provides memory-efficient attention but has compatibility issues.Install xformers:Warning: Avoid xformers 0.0.18 - it causes black images at high resolutions.Disable xformers if you experience issues:Note: PyTorch attention is now preferred over xformers on most systems.
Split/Quad Cross Attention
Split/Quad Cross Attention
Alternative attention modes for specific hardware:Split Cross Attention:
- Reduces memory usage
- Slower than standard attention
- Useful for very limited VRAM
- Even more memory efficient
- Significantly slower
- For extreme VRAM limitations
Precision and Data Types
Choosing the right precision can significantly impact performance and memory usage.FP16 Optimization (Recommended)
FP16 Optimization (Recommended)
Half-precision (FP16) reduces memory usage and increases speed on modern GPUs.Automatically used on:FP16 Accumulation (NVIDIA/AMD only):This enables FP16 matrix multiplication accumulation for additional speed on supported GPUs.
- NVIDIA GPUs with compute capability 8.0+ (RTX 30/40 series)
- AMD GPUs (all with ROCm)
- Apple Silicon (MPS)
- Intel XPU with FP16 support
BF16 Support
BF16 Support
BFloat16 offers better numerical stability than FP16 with similar performance.Automatically used on:Note: BF16 is automatically selected when supported and beneficial.
- NVIDIA GPUs with compute capability 8.0+
- AMD RDNA3+ GPUs
- Apple Silicon (macOS 14+)
- Intel XPU with BF16 support
FP8 Precision (Cutting Edge)
FP8 Precision (Cutting Edge)
FP8 precision offers significant memory savings on compatible hardware.Supported on:Note: FP8 is automatically used when models are loaded in FP8 format and the GPU supports FP8 compute.
- NVIDIA GPUs: RTX 40 series, H100, A100 (compute capability 8.9+)
- AMD GPUs: RDNA3+ with ROCm 6.4+ and PyTorch 2.7+
Force FP32 (Troubleshooting)
Force FP32 (Troubleshooting)
If you experience quality issues, force full precision:This disables all FP16/BF16 optimizations. Use only for debugging - significantly slower and uses more VRAM.
Caching and Execution
Output Caching
Output Caching
ComfyUI only re-executes parts of the workflow that change between runs.Cache Types:Classic Cache (Default):
- Caches all node outputs
- Best for most workflows
- Least Recently Used cache with size limit
- Value = number of cached items
- Good for memory-constrained systems
- Dynamically adjusts cache based on RAM usage
- Value = RAM usage threshold (0.0-1.0)
- Disables all caching
- Useful for debugging only
Smart Memory Management
Smart Memory Management
Enabled by default, smart memory automatically manages model loading and unloading.Disable for manual control:Warning: Only disable if you know what you’re doing. Smart memory prevents OOM errors.
Async Operations
Async Weight Offloading
Async Weight Offloading
Asynchronous offloading improves performance by overlapping memory transfers with computation.Automatically enabled on NVIDIA and AMD GPUs with 2 streams.Customize stream count:Disable if experiencing issues:Benefits:
- Reduces idle time during model loading
- Improves overall throughput
- Most effective with multiple large models
CUDA Optimizations (NVIDIA)
CUDA Optimizations (NVIDIA)
Additional NVIDIA-specific optimizations:Disable CUDA Malloc (older GPUs):Fast Mode (enables multiple optimizations):Enables:May improve performance on some models.
- FP16 accumulation
- cuDNN auto-tuning (first run slower, subsequent runs faster)
Platform-Specific Optimizations
AMD ROCm Optimizations
AMD ROCm Optimizations
AMD-specific settings for better performance:Enable AOTriton (RDNA3+):Enable TunableOp:
- First run is very slow (builds optimization database)
- Subsequent runs are faster
Intel XPU Optimizations
Intel XPU Optimizations
Intel Arc GPU optimizations:IPEX Optimization (enabled by default):
Automatically optimizes models for Intel XPU.Disable if experiencing issues:OneAPI Device Selector:Selects specific Intel device when multiple are available.
Apple Silicon (MPS) Optimization
Apple Silicon (MPS) Optimization
macOS-specific settings:MPS is automatically detected and used on Apple Silicon Macs.Force CPU mode if experiencing issues:VAE on CPU (reduces VRAM usage):Note: Non-blocking transfers are disabled on MPS due to PyTorch limitations.
Workflow Optimization Tips
Best Practices for Fast Workflows
Best Practices for Fast Workflows
- Minimize dynamic changes: ComfyUI only re-executes changed nodes. Keep static parts of your workflow unchanged.
- Use appropriate image sizes: Larger images require exponentially more VRAM and time.
- Batch processing: Process multiple images in a single batch when possible.
-
Preview settings: Use low-resolution previews during development:
- Model selection: Smaller models (SD 1.5) are faster than larger ones (SDXL, SD3).
- LoRA usage: Multiple LoRAs increase memory usage and loading time.
Monitoring and Diagnostics
Performance Monitoring
Performance Monitoring
Enable verbose logging to see detailed performance information:Check VRAM usage:
- ComfyUI logs total VRAM and RAM at startup
- Model loading/unloading is logged
- Performance warnings appear in console
- Ensures reproducible outputs
- Slightly slower
- Disables some optimizations