Overview
Quantization in ComfyUI enables you to reduce model memory footprint and increase throughput by converting high-precision weights to lower precision formats like FP8. This is achieved through a sophisticated system usingQuantizedTensor and layout-based quantization strategies.
How Quantization Works
Quantization maps high-precision values (FP16/FP32) to lower precision formats with minimal accuracy loss. The key challenge is managing dynamic range differences:- FP16 range: (-65,504, 65,504)
- FP8 E4M3 range: (-448, 448)
- FP8 E5M2 range: (-57,344, 57,344)
Scaling Factor Method
ComfyUI uses per-tensor absolute-maximum scaling to map values into the quantized format’s range:The scaling factor is additional metadata needed to interpret quantized values, making these “derived datatypes”.
Architecture
ComfyUI’s quantization system uses a three-layer architecture:QuantizedTensor
TheQuantizedTensor class (defined in comfy/quant_ops.py) is a subclass of torch.Tensor that represents quantized data with associated metadata.
Layout Classes
ALayout class defines how a specific quantization format behaves:
Built-in Quantization Formats
ComfyUI supports several quantization formats defined inQUANT_ALGOS:
TensorCore FP8 Layouts
- E4M3
- E5M2
FP8 E4M3 (4-bit exponent, 3-bit mantissa) provides better precision for most deep learning workloads:Range: -448 to 448
Operation Registry
ComfyUI uses two registry systems for quantized operations:Generic Registry
Handles operations common to all quantized formats (.to(), .clone(), .reshape()).
Layout-Specific Registry
Allows fast-path implementations for specific operations:When
torch.nn.functional.linear() is called with QuantizedTensor arguments, __torch_dispatch__ automatically routes to the registered implementation.Fallback Behavior
For any unsupported operation, QuantizedTensor automatically:- Calls
dequantizeto convert back to high precision - Dispatches using the high-precision implementation
Mixed Precision Operations
TheMixedPrecisionOps class enables per-layer quantization decisions, allowing different layers to use different precisions.
How It Works
During model loading, the customLinear._load_from_state_dict() method inspects each layer:
- Not in config: Load weight as regular tensor in
_compute_dtype - In config: Load weight as
QuantizedTensorwith specified layout (e.g.,TensorCoreFP8Layout)
Checkpoint Format
Quantized checkpoints are stored as standard safetensors files with:- Quantized weight tensors (sometimes using different storage datatype, e.g., uint8 for FP8)
- Associated scaling parameters
_quantization_metadataJSON entry
Scaling Parameters
ComfyUI defines 4 possible scaling parameters:Quantization scalers for the weights
Global scalers in the context of double scaling
Scalers used for smoothing salient weights
Quantization scalers for the activations
Parameter Requirements by Format
| Format | Storage dtype | weight_scale | weight_scale_2 | pre_quant_scale | input_scale |
|---|---|---|---|---|---|
| float8_e4m3fn | float32 | float32 (scalar) | - | - | float32 (scalar) |
Metadata Structure
The_quantization_metadata contains:
Creating Quantized Checkpoints
Weight Quantization
Weight quantization computes the scaling factor directly from the weight tensor:Activation Quantization (Calibration)
Activation quantization requires post-training calibration (PTQ):Stochastic Rounding
ComfyUI supports stochastic rounding for better quantization quality:Best Practices
Choose the Right Format
Use E4M3 for most workloads, E5M2 when you need wider dynamic range
Mixed Precision
Keep sensitive layers in FP16/BF16, quantize compute-heavy layers
Calibrate Properly
Use diverse, representative data for activation quantization calibration
Monitor Quality
Test quantized models thoroughly to ensure acceptable accuracy
Related
Memory Management
Learn about VRAM optimization and model offloading
Custom Nodes
Create custom nodes that work with quantized models