Skip to main content
CUTLASS provides comprehensive support for NVIDIA GPU architectures from Volta through Blackwell, with optimized implementations for each generation’s unique features.

Supported Architectures

CUTLASS supports the following NVIDIA GPU architectures:
ArchitectureCompute CapabilityKey FeaturesCUTLASS Support
VoltaSM70, SM72First-gen Tensor CoresBasic tensor ops
TuringSM75Enhanced Tensor CoresINT8/INT4 operations
AmpereSM80, SM86TF32, BF16, async copyFull tensor core suite
AdaSM89FP8 supportFP8 tensor operations
HopperSM90WGMMA, TMA, FP8Warpgroup operations
BlackwellSM100, SM103, SM120UMMA, TMEM, enhanced FP8Next-gen operations
CUTLASS requires CUDA 11.4 or later, with CUDA 12.8 recommended for best performance. Volta architecture (SM70) is the minimum supported compute capability.

Architecture-Specific Features

Data Type Support by Architecture

// From include/cutlass/arch/arch.h
struct Sm80 {
  static int const kMinComputeCapability = 80;
};

struct Sm90 {
  static int const kMinComputeCapability = 90;
};

struct Sm100 {
  static int const kMinComputeCapability = 100;
};

Tensor Core Evolution

Ampere (SM80/86)
  • TF32 (TensorFloat-32) for FP32 acceleration
  • BF16 (BFloat16) native support
  • FP64 tensor cores
  • Asynchronous copy (cp.async)
  • INT8/INT4/INT1 operations
Hopper (SM90)
  • Warpgroup Matrix Multiply-Accumulate (WGMMA)
  • Tensor Memory Accelerator (TMA)
  • FP8 (E4M3, E5M2) native support
  • Enhanced FP64 tensor cores
  • Thread block clusters
Blackwell (SM100/103/120)
  • Universal Matrix Multiply-Accumulate (UMMA)
  • Tensor Memory (TMEM)
  • Block-scaled data types (FP4, MXFP4/6/8)
  • Enhanced sparse operations
  • Advanced scheduling primitives

Memory Hierarchy

Different architectures provide different shared memory capacities:
// From include/cutlass/arch/arch.h
namespace cutlass::arch {
  constexpr int sm100_smem_capacity_bytes = 232448;  // 227 KB
  constexpr int sm120_smem_capacity_bytes = 101376;  // 99 KB
}
Shared memory allocation must be carefully managed, especially on architectures with limited capacity. Use dynamic shared memory or optimize tile sizes accordingly.

Target Architecture Compilation

Architecture-Accelerated Features

Starting with CUDA 12.0, certain Hopper and Blackwell features require architecture-accelerated PTX, indicated by the “a” suffix:
# For Hopper GH100 (required for WGMMA, TMA)
cmake .. -DCUTLASS_NVCC_ARCHS="90a"

# For Blackwell datacenter GPUs
cmake .. -DCUTLASS_NVCC_ARCHS="100a"

# For Blackwell GeForce RTX 50 series
cmake .. -DCUTLASS_NVCC_ARCHS="120"
Kernels compiled with sm_100a (datacenter) are not compatible with RTX 50 series GPUs (SM120). They require separate compilation targets.

Checking Architecture at Runtime

CUTLASS provides utilities to query the current architecture:
#include "cutlass/arch/arch.h"

// Get lane ID within a warp
CUTLASS_DEVICE int lane_id = cutlass::arch::LaneId();

// Get SM number
CUTLASS_DEVICE int sm_id = cutlass::arch::SmId();

// Architecture tags for template specialization
using ArchTag = cutlass::arch::Sm80;  // or Sm90, Sm100

Architecture Selection in Templates

CUTLASS uses architecture tags to specialize templates:
// Example from include/cutlass/arch/mma_sm80.h
template <>
struct Mma<
  gemm::GemmShape<16, 8, 8>,
  32,
  bfloat16_t,
  layout::RowMajor,
  bfloat16_t,
  layout::ColumnMajor,
  float,
  layout::RowMajor,
  OpMultiplyAdd> {
  
  using ArchTag = arch::Sm80;  // Specifies architecture requirement
  
  // ... implementation
};

Performance Characteristics

Each architecture has different peak theoretical performance:
  • Ampere A100: Up to 312 TFLOPS (FP16 Tensor Core)
  • Hopper H100: Up to 1,979 TFLOPS (FP8 Tensor Core)
  • Blackwell B200: Enhanced performance with UMMA instructions
CUTLASS achieves near-optimal utilization on all supported architectures when properly configured.

Compiler Preprocessor Macros

Architecture-specific code uses compile-time guards:
// Ampere
#if defined(CUTLASS_ARCH_MMA_SM80_ENABLED)
  // Ampere-specific code
#endif

// Hopper
#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
  // Hopper WGMMA code
#endif

// Blackwell
#if defined(CUTE_ARCH_FFMA2_SM100_ENABLED)
  // Blackwell UMMA code
#endif

Next Steps

Explore architecture-specific features:

Ampere (SM80/86)

TF32, BF16, async copy operations

Hopper (SM90)

WGMMA, TMA, FP8 support

Blackwell (SM100/120)

UMMA, TMEM, block-scaled types

Additional Resources

Build docs developers (and LLMs) love