Skip to main content

Introduction

CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM), convolution, and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN.

Core Components

The CUTLASS C++ API is organized into several key namespaces and components:

Namespace Organization

cutlass::                   // Top-level namespace
cutlass::gemm::             // GEMM operations
cutlass::conv::             // Convolution operations
cutlass::epilogue::         // Epilogue operations
cutlass::arch::             // Architecture-specific components
cute::                      // CuTe library (modern tensor abstraction)

Device-Level APIs

CUTLASS provides device-level operator templates that can be instantiated and launched from host code:
cutlass::gemm::device::Gemm
template class
Primary GEMM device operator for matrix multiplication
cutlass::gemm::device::GemmUniversal
template class
Universal GEMM supporting batched, grouped, and split-K modes
cutlass::conv::device::ImplicitGemmConvolution
template class
Implicit GEMM convolution device operator

Template Design Pattern

CUTLASS uses extensive compile-time template metaprogramming to optimize kernels:
// Element types
using ElementA = float;
using ElementB = float;
using ElementC = float;
using ElementAccumulator = float;

// Layout types
using LayoutA = cutlass::layout::ColumnMajor;
using LayoutB = cutlass::layout::ColumnMajor;
using LayoutC = cutlass::layout::ColumnMajor;

// Architecture tag
using ArchTag = cutlass::arch::Sm80;

// Operator class
using OperatorClass = cutlass::arch::OpClassTensorOp;

Hierarchical Decomposition

CUTLASS organizes computation at multiple levels:

Device Level

Kernel launch and grid management

Threadblock Level

Cooperative thread array (CTA) tiling

Warp Level

Warp-level matrix operations

Thread Level

Per-thread computation and data movement

Instruction Level

Hardware-accelerated instructions (Tensor Cores)

Tile Shapes

Performance is controlled by tile sizes at each level:
ThreadblockShape
GemmShape
Threadblock-level tile size (e.g., GemmShape<128, 128, 32>)
WarpShape
GemmShape
Warp-level tile size (e.g., GemmShape<64, 64, 32>)
InstructionShape
GemmShape
Instruction-level tile size (e.g., GemmShape<16, 8, 16> for Tensor Cores)

Common Data Types

CUTLASS defines several core data structures:

GemmCoord

cutlass::gemm::GemmCoord problem_size(M, N, K);
Represents the dimensions of a GEMM problem (M×N = M×K * K×N).

TensorRef

cutlass::TensorRef<ElementType, LayoutType> tensor_ref(ptr, leading_dimension);
A lightweight reference to a tensor in memory with layout information.

Status

cutlass::Status status = kernel(args);
if (status != cutlass::Status::kSuccess) {
  // Handle error
}
Return codes for CUTLASS operations.

API Usage Pattern

The typical workflow for using CUTLASS device operators:
1

Define the kernel type

Instantiate a CUTLASS template with desired types and parameters
2

Create kernel instance

Construct an instance of the kernel operator
3

Prepare arguments

Construct an Arguments structure with problem size, tensor references, and parameters
4

Check feasibility

Call can_implement() to verify the kernel can execute the problem
5

Get workspace size

Call get_workspace_size() if the kernel requires temporary storage
6

Initialize kernel

Call initialize() with arguments and workspace
7

Execute kernel

Call run() or the function call operator to launch the kernel
The initialize() and run() steps can be combined using the function call operator: status = kernel(args, workspace, stream);

File Organization

Key header files in the CUTLASS C++ API:
include/cutlass/
├── cutlass.h                    # Core definitions
├── gemm/
│   ├── gemm.h                  # GEMM type definitions
│   ├── device/
│   │   ├── gemm.h              # Device-level GEMM
│   │   └── gemm_universal.h    # Universal GEMM
│   └── kernel/                 # Kernel implementations
├── conv/
│   ├── convolution.h           # Convolution types
│   └── device/                 # Device-level convolution
├── epilogue/
│   └── thread/                 # Epilogue operations
└── layout/
    └── matrix.h                # Layout types

include/cute/
├── tensor.hpp                  # CuTe tensor abstraction
├── layout.hpp                  # CuTe layouts
└── algorithm/                  # CuTe algorithms

Architecture Support

CUTLASS supports multiple NVIDIA GPU architectures:
cutlass::arch::Sm70

Operator Classes

Different compute capabilities:
cutlass::arch::OpClassSimt
tag
CUDA cores (SIMT) - available on all architectures
cutlass::arch::OpClassTensorOp
tag
Tensor Cores - available on Volta and later
cutlass::arch::OpClassWmmaTensorOp
tag
WMMA Tensor Cores - Volta-specific

Next Steps

GEMM API

Learn about the GEMM API and usage patterns

CuTe Library

Explore the modern CuTe tensor abstraction

Epilogue Operations

Add custom operations to your kernels

Convolution API

Perform efficient convolution operations

Build docs developers (and LLMs) love