Overview
Instead of:- Reduces memory traffic
- Eliminates kernel launch overhead
- Improves overall throughput
Epilogue Fusion Configuration (EFC)
For Blackwell (SM100) and later architectures, CUTLASS provides the Epilogue Fusion Configuration (EFC) framework for defining custom epilogues.Basic Example: Alpha-Beta Scaling
This example demonstrates a GEMM with custom epilogue that computes:Running the Example
Command Line Usage
With NCU Profiling
Advanced Examples
Activation Functions
Fuse common activation functions:Multiple Outputs
Generate multiple output tensors:Complex Expressions
Implement sophisticated fusion patterns:Legacy Epilogue Interface (Pre-SM90)
For older architectures, use the high-level Python interface:examples/python/deprecated/01_epilogue.ipynb for more details.
Supported Operations
The epilogue can include:Mathematical Operations
- Addition, subtraction, multiplication, division
- Power, exponential, logarithm
- Trigonometric functions (sin, cos, tan)
Activation Functions
- ReLU, Leaky ReLU, PReLU
- Sigmoid, tanh
- GELU, SiLU
- Softmax (with limitations)
Memory Operations
- Load from multiple input tensors
- Store to multiple output tensors
- Conditional stores
Type Conversions
- Mixed precision computations
- Type casting between FP32, FP16, BF16, INT8
Performance Considerations
Memory Bandwidth
Epilogue fusion is most beneficial when:Register Usage
Monitor register usage with NCU:Data Type Support
Supported Input Types (A, B)
- FP16, BF16
- TF32
- INT8, UINT8
- FP8 (E4M3FN, E5M2)
Supported Accumulator Types
- FP32 (for all floating-point inputs)
- FP16 (for FP16 and FP8 inputs)
- INT32 (for INT8/UINT8 inputs)
Supported Output Types (C, D)
- FP32, FP16, BF16
- INT32, INT8, UINT8
- FP8 (E4M3FN, E5M2) with FP32 accumulator
Constraints
Debugging Tips
Enable Verification
Print Intermediate Values
Check Memory Alignment
Examples in the Repository
Find complete working examples:- Custom epilogue:
examples/python/CuTeDSL/blackwell/epilogue/custom_epilogue_dense_gemm.py - Activation fusion:
examples/python/CuTeDSL/blackwell/epilogue/activation_custom_epilogue_dense_gemm.py - Synthetic examples:
examples/python/CuTeDSL/blackwell/epilogue/synthetic_custom_epilogue_dense_gemm.py - Legacy interface:
examples/python/deprecated/01_epilogue.ipynb
Next Steps
Basic GEMM
Master basic GEMM operations first
Grouped GEMM
Combine custom epilogues with grouped operations