Introduction
CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM), convolution, and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN.Core Components
The CUTLASS C++ API is organized into several key namespaces and components:Namespace Organization
Device-Level APIs
CUTLASS provides device-level operator templates that can be instantiated and launched from host code:Primary GEMM device operator for matrix multiplication
Universal GEMM supporting batched, grouped, and split-K modes
Implicit GEMM convolution device operator
Template Design Pattern
CUTLASS uses extensive compile-time template metaprogramming to optimize kernels:Hierarchical Decomposition
CUTLASS organizes computation at multiple levels:Device Level
Kernel launch and grid management
Threadblock Level
Cooperative thread array (CTA) tiling
Warp Level
Warp-level matrix operations
Thread Level
Per-thread computation and data movement
Instruction Level
Hardware-accelerated instructions (Tensor Cores)
Tile Shapes
Performance is controlled by tile sizes at each level:Threadblock-level tile size (e.g.,
GemmShape<128, 128, 32>)Warp-level tile size (e.g.,
GemmShape<64, 64, 32>)Instruction-level tile size (e.g.,
GemmShape<16, 8, 16> for Tensor Cores)Common Data Types
CUTLASS defines several core data structures:GemmCoord
TensorRef
Status
API Usage Pattern
The typical workflow for using CUTLASS device operators:Prepare arguments
Construct an Arguments structure with problem size, tensor references, and parameters
The
initialize() and run() steps can be combined using the function call operator: status = kernel(args, workspace, stream);File Organization
Key header files in the CUTLASS C++ API:Architecture Support
CUTLASS supports multiple NVIDIA GPU architectures:Operator Classes
Different compute capabilities:CUDA cores (SIMT) - available on all architectures
Tensor Cores - available on Volta and later
WMMA Tensor Cores - Volta-specific
Next Steps
GEMM API
Learn about the GEMM API and usage patterns
CuTe Library
Explore the modern CuTe tensor abstraction
Epilogue Operations
Add custom operations to your kernels
Convolution API
Perform efficient convolution operations