C++ API Overview

Introduction

CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM), convolution, and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN.

Core Components

The CUTLASS C++ API is organized into several key namespaces and components:

Namespace Organization

cutlass::                   // Top-level namespace
cutlass::gemm::             // GEMM operations
cutlass::conv::             // Convolution operations
cutlass::epilogue::         // Epilogue operations
cutlass::arch::             // Architecture-specific components
cute::                      // CuTe library (modern tensor abstraction)

Device-Level APIs

CUTLASS provides device-level operator templates that can be instantiated and launched from host code:

cutlass::gemm::device::Gemm

template class

Primary GEMM device operator for matrix multiplication

cutlass::gemm::device::GemmUniversal

template class

Universal GEMM supporting batched, grouped, and split-K modes

cutlass::conv::device::ImplicitGemmConvolution

template class

Implicit GEMM convolution device operator

Template Design Pattern

CUTLASS uses extensive compile-time template metaprogramming to optimize kernels:

// Element types
using ElementA = float;
using ElementB = float;
using ElementC = float;
using ElementAccumulator = float;

// Layout types
using LayoutA = cutlass::layout::ColumnMajor;
using LayoutB = cutlass::layout::ColumnMajor;
using LayoutC = cutlass::layout::ColumnMajor;

// Architecture tag
using ArchTag = cutlass::arch::Sm80;

// Operator class
using OperatorClass = cutlass::arch::OpClassTensorOp;

Hierarchical Decomposition

CUTLASS organizes computation at multiple levels:

Device Level

Kernel launch and grid management

Threadblock Level

Cooperative thread array (CTA) tiling

Warp Level

Warp-level matrix operations

Thread Level

Per-thread computation and data movement

Instruction Level

Hardware-accelerated instructions (Tensor Cores)

Tile Shapes

Performance is controlled by tile sizes at each level:

ThreadblockShape

GemmShape

Threadblock-level tile size (e.g., GemmShape<128, 128, 32>)

WarpShape

GemmShape

Warp-level tile size (e.g., GemmShape<64, 64, 32>)

InstructionShape

GemmShape

Instruction-level tile size (e.g., GemmShape<16, 8, 16> for Tensor Cores)

Common Data Types

CUTLASS defines several core data structures:

GemmCoord

cutlass::gemm::GemmCoord problem_size(M, N, K);

Represents the dimensions of a GEMM problem (M×N = M×K * K×N).

TensorRef

cutlass::TensorRef<ElementType, LayoutType> tensor_ref(ptr, leading_dimension);

A lightweight reference to a tensor in memory with layout information.

Status

cutlass::Status status = kernel(args);
if (status != cutlass::Status::kSuccess) {
  // Handle error
}

Return codes for CUTLASS operations.

API Usage Pattern

The typical workflow for using CUTLASS device operators:

Define the kernel type

Instantiate a CUTLASS template with desired types and parameters

Create kernel instance

Construct an instance of the kernel operator

Prepare arguments

Construct an Arguments structure with problem size, tensor references, and parameters

Check feasibility

Call can_implement() to verify the kernel can execute the problem

Get workspace size

Call get_workspace_size() if the kernel requires temporary storage

Initialize kernel

Call initialize() with arguments and workspace

Execute kernel

Call run() or the function call operator to launch the kernel

The initialize() and run() steps can be combined using the function call operator: status = kernel(args, workspace, stream);

File Organization

Key header files in the CUTLASS C++ API:

include/cutlass/
├── cutlass.h                    # Core definitions
├── gemm/
│   ├── gemm.h                  # GEMM type definitions
│   ├── device/
│   │   ├── gemm.h              # Device-level GEMM
│   │   └── gemm_universal.h    # Universal GEMM
│   └── kernel/                 # Kernel implementations
├── conv/
│   ├── convolution.h           # Convolution types
│   └── device/                 # Device-level convolution
├── epilogue/
│   └── thread/                 # Epilogue operations
└── layout/
    └── matrix.h                # Layout types

include/cute/
├── tensor.hpp                  # CuTe tensor abstraction
├── layout.hpp                  # CuTe layouts
└── algorithm/                  # CuTe algorithms

Architecture Support

CUTLASS supports multiple NVIDIA GPU architectures:

cutlass::arch::Sm70

Operator Classes

Different compute capabilities:

cutlass::arch::OpClassSimt

tag

CUDA cores (SIMT) - available on all architectures

cutlass::arch::OpClassTensorOp

tag

Tensor Cores - available on Volta and later

cutlass::arch::OpClassWmmaTensorOp

tag

WMMA Tensor Cores - Volta-specific

Next Steps

GEMM API

Learn about the GEMM API and usage patterns

CuTe Library

Explore the modern CuTe tensor abstraction

Epilogue Operations

Add custom operations to your kernels

Convolution API

Perform efficient convolution operations

Get Started

Core Concepts

C++ API

Python DSL

Architecture Support

Performance

C++ API Overview

Introduction

Core Components

Namespace Organization

Device-Level APIs

Template Design Pattern

Hierarchical Decomposition

Device Level

Threadblock Level

Warp Level

Thread Level

Instruction Level

Tile Shapes

Common Data Types

GemmCoord

TensorRef

Status

API Usage Pattern

File Organization

Architecture Support

Operator Classes

Next Steps

GEMM API

CuTe Library

Epilogue Operations

Convolution API

Build docs developers (and LLMs) love

Get Started

Core Concepts

C++ API

Python DSL

Architecture Support

Performance

​Introduction

​Core Components

​Namespace Organization

​Device-Level APIs

​Template Design Pattern

​Hierarchical Decomposition

Device Level

Threadblock Level

Warp Level

Thread Level

Instruction Level

​Tile Shapes

​Common Data Types

​GemmCoord

​TensorRef

​Status

​API Usage Pattern

​File Organization

​Architecture Support

​Operator Classes

​Next Steps

GEMM API

CuTe Library

Epilogue Operations

Convolution API

Build docs developers (and LLMs) love

Introduction

Core Components

Namespace Organization

Device-Level APIs

Template Design Pattern

Hierarchical Decomposition

Tile Shapes

Common Data Types

GemmCoord

TensorRef

Status

API Usage Pattern

File Organization

Architecture Support

Operator Classes

Next Steps