Performance Optimization

This guide covers various techniques to improve ComfyUI’s performance based on your hardware configuration.

VRAM Management

ComfyUI includes smart memory management that automatically optimizes VRAM usage. Understanding the different VRAM modes helps you choose the best configuration.

VRAM State Modes

ComfyUI operates in different VRAM states depending on your hardware:HIGH_VRAM Mode (Default for GPUs with sufficient VRAM):

python main.py --highvram

Keeps models in VRAM
Fastest performance
Recommended for GPUs with 12GB+ VRAM

NORMAL_VRAM Mode (Default):

python main.py

Balanced approach
Moves models between RAM and VRAM as needed
Works for most configurations

LOW_VRAM Mode:

python main.py --lowvram

Offloads models more aggressively
Slower but works with limited VRAM
Recommended for GPUs with 4-8GB VRAM

NO_VRAM Mode:

python main.py --novram

Keeps minimal data in VRAM
Very slow but works with < 2GB VRAM
Last resort for extremely limited GPUs

GPU_ONLY Mode:

python main.py --gpu-only

Never offload to CPU
Useful when you want to keep everything on GPU
Requires sufficient VRAM for your workflow

VRAM Reservation

Reserve VRAM for other applications to prevent conflicts:

python main.py --reserve-vram 2.0

This reserves 2GB of VRAM. Adjust based on your needs:

Windows Default: 600MB (due to shared memory overhead)
Linux Default: 400MB
16GB+ GPUs: Additional 100MB reserved automatically

Lower reservation = more memory for ComfyUI but may cause issues with other GPU applications.

Memory Pinning

ComfyUI automatically pins memory for faster transfers between RAM and VRAM on NVIDIA and AMD GPUs.Automatic pinning limits:

Windows: 45% of system RAM (OS limit ~50%)
Linux: 95% of system RAM

To disable pinned memory:

python main.py --disable-pinned-memory

Only disable if you experience crashes or memory issues.

Attention Mechanisms

Different attention implementations offer varying performance characteristics.

PyTorch Attention (Recommended for modern GPUs)

PyTorch’s built-in attention is automatically enabled on NVIDIA GPUs with PyTorch 2.0+.Force enable:

python main.py --use-pytorch-cross-attention

Automatically enabled for:

NVIDIA GPUs with PyTorch 2.0+
Intel XPU
Ascend NPU
Cambricon MLU

Benefits:

Flash Attention support on compatible hardware
Memory efficient attention
Generally faster on modern GPUs

xformers Attention

xformers provides memory-efficient attention but has compatibility issues.Install xformers:

pip install xformers

Warning: Avoid xformers 0.0.18 - it causes black images at high resolutions.Disable xformers if you experience issues:

python main.py --disable-xformers

Note: PyTorch attention is now preferred over xformers on most systems.

Split/Quad Cross Attention

Alternative attention modes for specific hardware:Split Cross Attention:

python main.py --use-split-cross-attention

Reduces memory usage
Slower than standard attention
Useful for very limited VRAM

Quad Cross Attention:

python main.py --use-quad-cross-attention

Even more memory efficient
Significantly slower
For extreme VRAM limitations

Precision and Data Types

Choosing the right precision can significantly impact performance and memory usage.

FP16 Optimization (Recommended)

Half-precision (FP16) reduces memory usage and increases speed on modern GPUs.Automatically used on:

NVIDIA GPUs with compute capability 8.0+ (RTX 30/40 series)
AMD GPUs (all with ROCm)
Apple Silicon (MPS)
Intel XPU with FP16 support

Force FP16:

python main.py --fp16-unet --fp16-vae --fp16-text-enc

FP16 Accumulation (NVIDIA/AMD only):

python main.py --fast

This enables FP16 matrix multiplication accumulation for additional speed on supported GPUs.

BF16 Support

BFloat16 offers better numerical stability than FP16 with similar performance.Automatically used on:

NVIDIA GPUs with compute capability 8.0+
AMD RDNA3+ GPUs
Apple Silicon (macOS 14+)
Intel XPU with BF16 support

Force BF16:

python main.py --bf16-unet

Note: BF16 is automatically selected when supported and beneficial.

FP8 Precision (Cutting Edge)

FP8 precision offers significant memory savings on compatible hardware.Supported on:

NVIDIA GPUs: RTX 40 series, H100, A100 (compute capability 8.9+)
AMD GPUs: RDNA3+ with ROCm 6.4+ and PyTorch 2.7+

Force FP8:

python main.py --fp8-e4m3fn-unet

Note: FP8 is automatically used when models are loaded in FP8 format and the GPU supports FP8 compute.

Force FP32 (Troubleshooting)

If you experience quality issues, force full precision:

python main.py --force-fp32

This disables all FP16/BF16 optimizations. Use only for debugging - significantly slower and uses more VRAM.

Caching and Execution

Output Caching

ComfyUI only re-executes parts of the workflow that change between runs.Cache Types:Classic Cache (Default):

Caches all node outputs
Best for most workflows

LRU Cache:

python main.py --cache-lru 100

Least Recently Used cache with size limit
Value = number of cached items
Good for memory-constrained systems

RAM Pressure Cache:

python main.py --cache-ram 0.8

Dynamically adjusts cache based on RAM usage
Value = RAM usage threshold (0.0-1.0)

No Cache:

python main.py --cache-none

Disables all caching
Useful for debugging only

Smart Memory Management

Enabled by default, smart memory automatically manages model loading and unloading.Disable for manual control:

python main.py --disable-smart-memory

Warning: Only disable if you know what you’re doing. Smart memory prevents OOM errors.

Async Operations

Async Weight Offloading

Asynchronous offloading improves performance by overlapping memory transfers with computation.Automatically enabled on NVIDIA and AMD GPUs with 2 streams.Customize stream count:

python main.py --async-offload 3

Disable if experiencing issues:

python main.py --disable-async-offload

Benefits:

Reduces idle time during model loading
Improves overall throughput
Most effective with multiple large models

CUDA Optimizations (NVIDIA)

Additional NVIDIA-specific optimizations:Disable CUDA Malloc (older GPUs):

python main.py --disable-cuda-malloc

Fast Mode (enables multiple optimizations):

python main.py --fast

Enables:

FP16 accumulation
cuDNN auto-tuning (first run slower, subsequent runs faster)

Force Channels Last (experimental):

python main.py --force-channels-last

May improve performance on some models.

Platform-Specific Optimizations

AMD ROCm Optimizations

AMD-specific settings for better performance:Enable AOTriton (RDNA3+):

TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python main.py --use-pytorch-cross-attention

Enable TunableOp:

PYTORCH_TUNABLEOP_ENABLED=1 python main.py

First run is very slow (builds optimization database)
Subsequent runs are faster

Disable cuDNN for RDNA3+: Automatically disabled on RDNA3+ for better performance. To re-enable:

COMFYUI_ENABLE_MIOPEN=1 python main.py

Intel XPU Optimizations

Intel Arc GPU optimizations:IPEX Optimization (enabled by default): Automatically optimizes models for Intel XPU.Disable if experiencing issues:

python main.py --disable-ipex-optimize

OneAPI Device Selector:

python main.py --oneapi-device-selector "level_zero:0"

Selects specific Intel device when multiple are available.

Apple Silicon (MPS) Optimization

macOS-specific settings:MPS is automatically detected and used on Apple Silicon Macs.Force CPU mode if experiencing issues:

python main.py --cpu

VAE on CPU (reduces VRAM usage):

python main.py --cpu-vae

Note: Non-blocking transfers are disabled on MPS due to PyTorch limitations.

Workflow Optimization Tips

Best Practices for Fast Workflows

Minimize dynamic changes: ComfyUI only re-executes changed nodes. Keep static parts of your workflow unchanged.
Use appropriate image sizes: Larger images require exponentially more VRAM and time.
Batch processing: Process multiple images in a single batch when possible.
Preview settings: Use low-resolution previews during development:
```
python main.py --preview-method auto
```
Model selection: Smaller models (SD 1.5) are faster than larger ones (SDXL, SD3).
LoRA usage: Multiple LoRAs increase memory usage and loading time.

Monitoring and Diagnostics

Performance Monitoring

Enable verbose logging to see detailed performance information:

python main.py --verbose DEBUG

Check VRAM usage:

ComfyUI logs total VRAM and RAM at startup
Model loading/unloading is logged
Performance warnings appear in console

Deterministic mode (for reproducible results):

python main.py --deterministic

Ensures reproducible outputs
Slightly slower
Disables some optimizations

Tutorials

Integration

Troubleshooting

Performance Optimization

VRAM Management

Attention Mechanisms

Precision and Data Types

Caching and Execution

Async Operations

Platform-Specific Optimizations

Workflow Optimization Tips

Monitoring and Diagnostics

Build docs developers (and LLMs) love

Tutorials

Integration

Troubleshooting

​VRAM Management

​Attention Mechanisms

​Precision and Data Types

​Caching and Execution

​Async Operations

​Platform-Specific Optimizations

​Workflow Optimization Tips

​Monitoring and Diagnostics

Build docs developers (and LLMs) love

VRAM Management

Attention Mechanisms

Precision and Data Types

Caching and Execution

Async Operations

Platform-Specific Optimizations

Workflow Optimization Tips

Monitoring and Diagnostics