Supported Architectures
CUTLASS supports the following NVIDIA GPU architectures:| Architecture | Compute Capability | Key Features | CUTLASS Support |
|---|---|---|---|
| Volta | SM70, SM72 | First-gen Tensor Cores | Basic tensor ops |
| Turing | SM75 | Enhanced Tensor Cores | INT8/INT4 operations |
| Ampere | SM80, SM86 | TF32, BF16, async copy | Full tensor core suite |
| Ada | SM89 | FP8 support | FP8 tensor operations |
| Hopper | SM90 | WGMMA, TMA, FP8 | Warpgroup operations |
| Blackwell | SM100, SM103, SM120 | UMMA, TMEM, enhanced FP8 | Next-gen operations |
CUTLASS requires CUDA 11.4 or later, with CUDA 12.8 recommended for best performance. Volta architecture (SM70) is the minimum supported compute capability.
Architecture-Specific Features
Data Type Support by Architecture
Tensor Core Evolution
Ampere (SM80/86)- TF32 (TensorFloat-32) for FP32 acceleration
- BF16 (BFloat16) native support
- FP64 tensor cores
- Asynchronous copy (
cp.async) - INT8/INT4/INT1 operations
- Warpgroup Matrix Multiply-Accumulate (WGMMA)
- Tensor Memory Accelerator (TMA)
- FP8 (E4M3, E5M2) native support
- Enhanced FP64 tensor cores
- Thread block clusters
- Universal Matrix Multiply-Accumulate (UMMA)
- Tensor Memory (TMEM)
- Block-scaled data types (FP4, MXFP4/6/8)
- Enhanced sparse operations
- Advanced scheduling primitives
Memory Hierarchy
Different architectures provide different shared memory capacities:Target Architecture Compilation
Architecture-Accelerated Features
Starting with CUDA 12.0, certain Hopper and Blackwell features require architecture-accelerated PTX, indicated by the “a” suffix:Kernels compiled with
sm_100a (datacenter) are not compatible with RTX 50 series GPUs (SM120). They require separate compilation targets.Checking Architecture at Runtime
CUTLASS provides utilities to query the current architecture:Architecture Selection in Templates
CUTLASS uses architecture tags to specialize templates:Performance Characteristics
Each architecture has different peak theoretical performance:- Ampere A100: Up to 312 TFLOPS (FP16 Tensor Core)
- Hopper H100: Up to 1,979 TFLOPS (FP8 Tensor Core)
- Blackwell B200: Enhanced performance with UMMA instructions
Compiler Preprocessor Macros
Architecture-specific code uses compile-time guards:Next Steps
Explore architecture-specific features:Ampere (SM80/86)
TF32, BF16, async copy operations
Hopper (SM90)
WGMMA, TMA, FP8 support
Blackwell (SM100/120)
UMMA, TMEM, block-scaled types