Memory Management

Overview

ComfyUI includes a sophisticated memory management system that automatically handles VRAM allocation, model loading/unloading, and memory optimization across different hardware configurations.

VRAM States

ComfyUI adapts its behavior based on available VRAM through the VRAMState enum:

class VRAMState(Enum):
    DISABLED = 0    # No VRAM present: no need to move models to VRAM
    NO_VRAM = 1     # Very low VRAM: enable all options to save VRAM
    LOW_VRAM = 2    # Limited VRAM: selective model loading
    NORMAL_VRAM = 3 # Standard operation
    HIGH_VRAM = 4   # Keep models in VRAM
    SHARED = 5      # Shared CPU/GPU memory (e.g., integrated graphics)

State Selection

The VRAM state is automatically detected based on:

Total VRAM available
Command-line arguments (--lowvram, --highvram, --novram, --gpu-only)
Hardware type (CPU, CUDA, MPS, XPU, etc.)

python main.py --lowvram

Device Management

Supported Devices

ComfyUI supports multiple compute backends:

CUDA (NVIDIA)
MPS (Apple)
XPU (Intel)
NPU (Ascend)

def get_torch_device():
    return torch.device(torch.cuda.current_device())

Primary device for NVIDIA GPUs with full feature support.

if torch.backends.mps.is_available():
    cpu_state = CPUState.MPS
    device = torch.device("mps")

Metal Performance Shaders for Apple Silicon.

if torch.xpu.is_available():
    device = torch.device("xpu", torch.xpu.current_device())

Intel extension for PyTorch.

if torch.npu.is_available():
    device = torch.device("npu", torch.npu.current_device())

Ascend NPU support.

Getting Memory Information

from comfy import model_management

# Get total memory
total_vram = model_management.get_total_memory()
print(f"Total VRAM: {total_vram / (1024**3):.2f} GB")

# Get free memory
free_vram = model_management.get_free_memory()
print(f"Free VRAM: {free_vram / (1024**3):.2f} GB")

# Get both hardware and PyTorch memory
mem_total, mem_torch = model_management.get_free_memory(
    device, 
    torch_free_too=True
)

The memory reporting differs between hardware backends. CUDA uses torch.cuda.mem_get_info(), while XPU and NPU use their respective APIs.

Model Loading System

LoadedModel Class

The LoadedModel class tracks loaded models and their memory usage:

class LoadedModel:
    def __init__(self, model):
        self.device = model.load_device
        self.real_model = None
        self.currently_used = True
        
    def model_memory(self):
        """Total model size"""
        return self.model.model_size()
    
    def model_loaded_memory(self):
        """Currently loaded portion"""
        return self.model.loaded_size()
    
    def model_offloaded_memory(self):
        """Portion offloaded to CPU/disk"""
        return self.model.model_size() - self.model.loaded_size()

Loading Models to GPU

The load_models_gpu() function orchestrates model loading:

def load_models_gpu(
    models, 
    memory_required=0, 
    force_patch_weights=False,
    minimum_memory_required=None, 
    force_full_load=False
):
    # 1. Calculate total memory required
    # 2. Free memory if needed
    # 3. Load models with partial loading if necessary
    # 4. Track loaded models
    pass

Memory Calculation

Calculate total VRAM and RAM requirements for all models

Free Memory

Unload existing models if insufficient memory

Partial Loading

Load only essential parts of models in low VRAM scenarios

Track Models

Add loaded models to current_loaded_models list

Partial Model Loading (LowVRAM)

In low VRAM scenarios, ComfyUI loads only portions of models:

if vram_set_state == VRAMState.LOW_VRAM:
    loaded_memory = loaded_model.model_loaded_memory()
    current_free_mem = get_free_memory(torch_dev) + loaded_memory
    
    # Calculate how much of the model to load
    lowvram_model_memory = max(
        0,
        (current_free_mem - minimum_memory_required),
        min(
            current_free_mem * MIN_WEIGHT_MEMORY_RATIO,
            current_free_mem - minimum_inference_memory()
        )
    )
    
    loaded_model.model_load(lowvram_model_memory)

MIN_WEIGHT_MEMORY_RATIO is set to 0.0 for NVIDIA (allowing more aggressive offloading) and 0.4 for other devices.

Memory Freeing Strategy

The free_memory() function intelligently unloads models:

def free_memory(
    memory_required, 
    device, 
    keep_loaded=[], 
    for_dynamic=False, 
    ram_required=0
):
    # 1. Garbage collect
    cleanup_models_gc()
    
    # 2. Find candidate models to unload
    can_unload = []
    for shift_model in current_loaded_models:
        if shift_model not in keep_loaded:
            can_unload.append(shift_model)
    
    # 3. Sort by offloaded memory (unload least active first)
    for model in sorted(can_unload):
        if get_free_memory(device) < memory_required:
            model.model_unload()
    
    # 4. Clear cache
    soft_empty_cache()

Unloading Priority

Models are prioritized for unloading based on:

Offloaded memory (models with more already offloaded)
Reference count (fewer references = unload first)
Total memory (larger models considered)

Reserved Memory

ComfyUI reserves memory for system operations:

EXTRA_RESERVED_VRAM = 400 * 1024 * 1024  # 400 MB base

if WINDOWS:
    EXTRA_RESERVED_VRAM = 600 * 1024 * 1024  # 600 MB on Windows
    if total_vram > (15 * 1024):  # 16GB+ cards
        EXTRA_RESERVED_VRAM += 100 * 1024 * 1024

def minimum_inference_memory():
    return (1024 * 1024 * 1024) * 0.8 + extra_reserved_memory()

Windows requires more reserved VRAM due to shared memory management. You can override this with --reserve-vram argument.

Async Weight Offloading

ComfyUI supports asynchronous weight offloading with CUDA/XPU streams:

NUM_STREAMS = 2  # Default for NVIDIA and AMD

if args.async_offload is not None:
    NUM_STREAMS = args.async_offload

def get_offload_stream(device):
    if NUM_STREAMS == 0:
        return None
    
    if device in STREAMS:
        ss = STREAMS[device]
        ss[stream_counter].wait_stream(current_stream(device))
        stream_counter = (stream_counter + 1) % len(ss)
        return ss[stream_counter]

Cast Buffers

To reduce memory allocations, ComfyUI maintains reusable cast buffers:

STREAM_CAST_BUFFERS = {}

def get_cast_buffer(offload_stream, device, size, ref):
    cast_buffer = STREAM_CAST_BUFFERS.get(offload_stream, None)
    if cast_buffer is None or cast_buffer.numel() < size:
        with offload_stream.as_context(offload_stream):
            cast_buffer = torch.empty((size), dtype=torch.int8, device=device)
            STREAM_CAST_BUFFERS[offload_stream] = cast_buffer
    return cast_buffer

Pinned Memory

For faster CPU-GPU transfers, ComfyUI pins CPU memory:

MAX_PINNED_MEMORY = get_total_memory(torch.device("cpu")) * 0.95

def pin_memory(tensor):
    if not is_device_cpu(tensor.device):
        return False
    
    size = tensor.nbytes
    if (TOTAL_PINNED_MEMORY + size) > MAX_PINNED_MEMORY:
        return False
    
    ptr = tensor.data_ptr()
    if torch.cuda.cudart().cudaHostRegister(ptr, size, 1) == 0:
        PINNED_MEMORY[ptr] = size
        TOTAL_PINNED_MEMORY += size
        return True

Pinned memory enables faster DMA transfers but is limited to ~95% of system RAM on Linux and ~45% on Windows.

Device Placement Functions

ComfyUI provides specialized functions for different model components:

UNet/Diffusion Models

def unet_offload_device():
    if vram_state == VRAMState.HIGH_VRAM:
        return get_torch_device()
    else:
        return torch.device("cpu")

def unet_inital_load_device(parameters, dtype):
    torch_dev = get_torch_device()
    if vram_state == VRAMState.HIGH_VRAM or vram_state == VRAMState.SHARED:
        return torch_dev
    
    model_size = dtype_size(dtype) * parameters
    if model_size < get_free_memory(torch_dev):
        return torch_dev
    return torch.device("cpu")

Text Encoders

def text_encoder_device():
    if args.gpu_only:
        return get_torch_device()
    elif vram_state == VRAMState.HIGH_VRAM or vram_state == VRAMState.NORMAL_VRAM:
        if should_use_fp16(prioritize_performance=False):
            return get_torch_device()
    return torch.device("cpu")

VAE

def vae_device():
    if args.cpu_vae:
        return torch.device("cpu")
    return get_torch_device()

def vae_offload_device():
    if args.gpu_only:
        return get_torch_device()
    return torch.device("cpu")

Garbage Collection

ComfyUI monitors for memory leaks and triggers garbage collection:

def cleanup_models_gc():
    do_gc = False
    reset_cast_buffers()
    
    for cur in current_loaded_models:
        if cur.is_dead():
            logging.info(
                f"Potential memory leak detected with model {cur.real_model().__class__.__name__}"
            )
            do_gc = True
            break
    
    if do_gc:
        gc.collect()
        soft_empty_cache()

If you see memory leak warnings, check for circular references in your custom nodes or models.

Command-Line Options

--lowvram

boolean

Enable low VRAM mode for GPUs with limited memory

--highvram

boolean

Keep models in VRAM (disables offloading)

--novram

boolean

Minimal VRAM usage (very slow)

--gpu-only

boolean

Never offload models to CPU

--reserve-vram

float

Reserve N GB of VRAM for other applications

--disable-smart-memory

boolean

Disable intelligent memory management

--disable-async-offload

boolean

Disable async weight offloading

--async-offload

integer

Number of streams for async offloading (default: 2)

--disable-pinned-memory

boolean

Disable pinned memory for CPU-GPU transfers

Best Practices

Monitor VRAM Usage

Use get_free_memory() to track available VRAM during execution

Avoid Memory Leaks

Ensure proper cleanup in custom nodes using weak references

Use Appropriate VRAM State

Let ComfyUI auto-detect or manually set based on your workflow

Enable Async Offload

Keep async offloading enabled for better performance on NVIDIA/AMD

Quantization

Reduce model memory footprint with FP8/FP16 quantization

Custom Nodes

Build memory-efficient custom nodes

Get Started

Core Concepts

Supported Models

Advanced Features

Configuration

Memory Management

Overview

VRAM States

State Selection

Device Management

Supported Devices

Getting Memory Information

Model Loading System

LoadedModel Class

Loading Models to GPU

Partial Model Loading (LowVRAM)

Memory Freeing Strategy

Unloading Priority

Reserved Memory

Async Weight Offloading

Cast Buffers

Pinned Memory

Device Placement Functions

UNet/Diffusion Models

Text Encoders

VAE

Garbage Collection

Command-Line Options

Best Practices

Monitor VRAM Usage

Avoid Memory Leaks

Use Appropriate VRAM State

Enable Async Offload

Quantization

Custom Nodes

Build docs developers (and LLMs) love

Get Started

Core Concepts

Supported Models

Advanced Features

Configuration

​Overview

​VRAM States

​State Selection

​Device Management

​Supported Devices

​Getting Memory Information

​Model Loading System

​LoadedModel Class

​Loading Models to GPU

​Partial Model Loading (LowVRAM)

​Memory Freeing Strategy

​Unloading Priority

​Reserved Memory

​Async Weight Offloading

​Cast Buffers

​Pinned Memory

​Device Placement Functions

​UNet/Diffusion Models

​Text Encoders

​VAE

​Garbage Collection

​Command-Line Options

​Best Practices

Monitor VRAM Usage

Avoid Memory Leaks

Use Appropriate VRAM State

Enable Async Offload

​Related

Quantization

Custom Nodes

Build docs developers (and LLMs) love

Overview

VRAM States

State Selection

Device Management

Supported Devices

Getting Memory Information

Model Loading System

LoadedModel Class

Loading Models to GPU

Partial Model Loading (LowVRAM)

Memory Freeing Strategy

Unloading Priority

Reserved Memory

Async Weight Offloading

Cast Buffers

Pinned Memory

Device Placement Functions

UNet/Diffusion Models

Text Encoders

VAE

Garbage Collection

Command-Line Options

Best Practices

Related