Skip to main content

Overview

Quantization in ComfyUI enables you to reduce model memory footprint and increase throughput by converting high-precision weights to lower precision formats like FP8. This is achieved through a sophisticated system using QuantizedTensor and layout-based quantization strategies.

How Quantization Works

Quantization maps high-precision values (FP16/FP32) to lower precision formats with minimal accuracy loss. The key challenge is managing dynamic range differences:
  • FP16 range: (-65,504, 65,504)
  • FP8 E4M3 range: (-448, 448)
  • FP8 E5M2 range: (-57,344, 57,344)

Scaling Factor Method

ComfyUI uses per-tensor absolute-maximum scaling to map values into the quantized format’s range:
absmax = max(abs(tensor))
scale = amax / max_dynamic_range_low_precision

# Quantization
tensor_q = (tensor / scale).to(low_precision_dtype)

# De-Quantization
tensor_dq = tensor_q.to(fp16) * scale
The scaling factor is additional metadata needed to interpret quantized values, making these “derived datatypes”.

Architecture

ComfyUI’s quantization system uses a three-layer architecture:
QuantizedTensor (torch.Tensor subclass)
  ↓ __torch_dispatch__
Two-Level Registry (generic + layout handlers)

MixedPrecisionOps + Metadata Detection

QuantizedTensor

The QuantizedTensor class (defined in comfy/quant_ops.py) is a subclass of torch.Tensor that represents quantized data with associated metadata.

Layout Classes

A Layout class defines how a specific quantization format behaves:
from comfy.quant_ops import QuantizedLayout

class MyLayout(QuantizedLayout):
    @classmethod
    def quantize(cls, tensor, **kwargs):
        # Convert to quantized format
        qdata = ...
        params = {'scale': ..., 'orig_dtype': tensor.dtype}
        return qdata, params
    
    @staticmethod
    def dequantize(qdata, scale, orig_dtype, **kwargs):
        return qdata.to(orig_dtype) * scale

Built-in Quantization Formats

ComfyUI supports several quantization formats defined in QUANT_ALGOS:
"float8_e4m3fn": {
    "storage_t": torch.float8_e4m3fn,
    "parameters": {"weight_scale", "input_scale"},
    "comfy_tensor_layout": "TensorCoreFP8E4M3Layout",
}

TensorCore FP8 Layouts

FP8 E4M3 (4-bit exponent, 3-bit mantissa) provides better precision for most deep learning workloads:
from comfy.quant_ops import TensorCoreFP8E4M3Layout, QuantizedTensor

class TensorCoreFP8E4M3Layout(_TensorCoreFP8LayoutBase):
    FP8_DTYPE = torch.float8_e4m3fn
Range: -448 to 448

Operation Registry

ComfyUI uses two registry systems for quantized operations:

Generic Registry

Handles operations common to all quantized formats (.to(), .clone(), .reshape()).

Layout-Specific Registry

Allows fast-path implementations for specific operations:
from comfy.quant_ops import register_layout_op

@register_layout_op(torch.ops.aten.linear.default, MyLayout)
def my_linear(func, args, kwargs):
    # Extract tensors, call optimized kernel
    ...
When torch.nn.functional.linear() is called with QuantizedTensor arguments, __torch_dispatch__ automatically routes to the registered implementation.

Fallback Behavior

For any unsupported operation, QuantizedTensor automatically:
  1. Calls dequantize to convert back to high precision
  2. Dispatches using the high-precision implementation

Mixed Precision Operations

The MixedPrecisionOps class enables per-layer quantization decisions, allowing different layers to use different precisions.
class MixedPrecisionOps(disable_weight_init):
    _layer_quant_config = {}  # Maps layer names to quantization configs
    _compute_dtype = torch.bfloat16  # Default compute / dequantize precision

How It Works

During model loading, the custom Linear._load_from_state_dict() method inspects each layer:
  1. Not in config: Load weight as regular tensor in _compute_dtype
  2. In config: Load weight as QuantizedTensor with specified layout (e.g., TensorCoreFP8Layout)
Not all layers tolerate quantization equally. Sensitive operations like final projections should be kept in higher precision, while compute-heavy matmuls can be quantized.

Checkpoint Format

Quantized checkpoints are stored as standard safetensors files with:
  • Quantized weight tensors (sometimes using different storage datatype, e.g., uint8 for FP8)
  • Associated scaling parameters
  • _quantization_metadata JSON entry

Scaling Parameters

ComfyUI defines 4 possible scaling parameters:
weight_scale
tensor
Quantization scalers for the weights
weight_scale_2
tensor
Global scalers in the context of double scaling
pre_quant_scale
tensor
Scalers used for smoothing salient weights
input_scale
tensor
Quantization scalers for the activations

Parameter Requirements by Format

FormatStorage dtypeweight_scaleweight_scale_2pre_quant_scaleinput_scale
float8_e4m3fnfloat32float32 (scalar)--float32 (scalar)

Metadata Structure

The _quantization_metadata contains:
{
  "_quantization_metadata": {
    "format_version": "1.0",
    "layers": {
      "model.layers.0.mlp.up_proj": "float8_e4m3fn",
      "model.layers.0.mlp.down_proj": "float8_e4m3fn",
      "model.layers.1.mlp.up_proj": "float8_e4m3fn"
    }
  }
}

Creating Quantized Checkpoints

Weight Quantization

Weight quantization computes the scaling factor directly from the weight tensor:
from comfy.quant_ops import TensorCoreFP8E4M3Layout

# Quantize a weight tensor
weight = model.layer.weight
qdata, params = TensorCoreFP8E4M3Layout.quantize(
    weight, 
    scale="recalculate",  # Auto-compute scale
    stochastic_rounding=0  # Deterministic rounding
)

Activation Quantization (Calibration)

Activation quantization requires post-training calibration (PTQ):
1

Collect Statistics

Run inference on N representative samples
2

Track Activations

Record the absolute maximum (amax) of inputs to each quantized layer
3

Compute Scales

Derive input_scale from collected statistics
4

Store in Checkpoint

Save input_scale parameters alongside weights
The calibration dataset should be representative of your target use case. For diffusion models, use a diverse set of prompts and generation parameters.

Stochastic Rounding

ComfyUI supports stochastic rounding for better quantization quality:
qdata, params = TensorCoreFP8E4M3Layout.quantize(
    tensor,
    scale="recalculate",
    stochastic_rounding=42  # Seed for reproducibility
)

Best Practices

Choose the Right Format

Use E4M3 for most workloads, E5M2 when you need wider dynamic range

Mixed Precision

Keep sensitive layers in FP16/BF16, quantize compute-heavy layers

Calibrate Properly

Use diverse, representative data for activation quantization calibration

Monitor Quality

Test quantized models thoroughly to ensure acceptable accuracy

Memory Management

Learn about VRAM optimization and model offloading

Custom Nodes

Create custom nodes that work with quantized models

Build docs developers (and LLMs) love