Skip to main content
This guide will help you compile and run your first CUTLASS GEMM kernel in both C++ and Python.

Prerequisites

Before you begin, ensure you have:
  • NVIDIA GPU with compute capability 7.0+ (Volta or newer)
  • CUDA Toolkit 11.4 or higher (12.8+ recommended)
  • C++17 compatible compiler (GCC 7.5+, Clang 7+, or MSVC)
  • CMake 3.19+ (for C++ examples)
  • Python 3.8+ (for Python interface)
For best performance and latest features, use CUDA Toolkit 12.8 or higher with a Hopper (H100) or Blackwell (B200/RTX 50 series) GPU.

C++ quick start

CUTLASS is a header-only library for C++, making it easy to integrate into your projects.
1

Clone the repository

git clone https://github.com/NVIDIA/cutlass.git
cd cutlass
export CUTLASS_PATH=$(pwd)
2

Set up environment variables

export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
# Example: export CUDACXX=/usr/local/cuda/bin/nvcc
3

Create a basic GEMM program

Create a file named my_gemm.cu:
my_gemm.cu
#include <iostream>
#include "cutlass/gemm/device/gemm.h"

int main() {
  // Matrix dimensions: M x K * K x N = M x N
  int M = 1024;
  int N = 1024;
  int K = 1024;

  // Define the GEMM operation
  using ColumnMajor = cutlass::layout::ColumnMajor;
  using CutlassGemm = cutlass::gemm::device::Gemm<
    float,        // Data type of A matrix
    ColumnMajor,  // Layout of A matrix
    float,        // Data type of B matrix
    ColumnMajor,  // Layout of B matrix
    float,        // Data type of C matrix
    ColumnMajor   // Layout of C matrix
  >;

  // Allocate device memory
  float *A, *B, *C;
  size_t size_A = M * K * sizeof(float);
  size_t size_B = K * N * sizeof(float);
  size_t size_C = M * N * sizeof(float);
  
  cudaMalloc(&A, size_A);
  cudaMalloc(&B, size_B);
  cudaMalloc(&C, size_C);

  // Initialize matrices (simplified - you'd fill with real data)
  cudaMemset(A, 0, size_A);
  cudaMemset(B, 0, size_B);
  cudaMemset(C, 0, size_C);

  // Set up GEMM arguments: D = alpha * A * B + beta * C
  float alpha = 1.0f;
  float beta = 0.0f;
  
  CutlassGemm gemm_op;
  CutlassGemm::Arguments args(
    {M, N, K},     // Problem dimensions
    {A, K},        // Tensor-ref for A
    {B, N},        // Tensor-ref for B
    {C, N},        // Tensor-ref for C
    {C, N},        // Tensor-ref for D (output)
    {alpha, beta}  // Epilogue scalars
  );

  // Launch the GEMM kernel
  cutlass::Status status = gemm_op(args);
  
  if (status != cutlass::Status::kSuccess) {
    std::cerr << "CUTLASS GEMM kernel failed" << std::endl;
    return -1;
  }

  // Wait for completion
  cudaDeviceSynchronize();

  std::cout << "GEMM completed successfully!" << std::endl;

  // Cleanup
  cudaFree(A);
  cudaFree(B);
  cudaFree(C);

  return 0;
}
4

Compile the program

nvcc -I${CUTLASS_PATH}/include \
     -std=c++17 \
     -gencode arch=compute_80,code=sm_80 \
     my_gemm.cu -o my_gemm
Replace compute_80,sm_80 with your GPU’s compute capability:
  • Ampere (A100): compute_80,sm_80
  • Ada (RTX 40xx): compute_89,sm_89
  • Hopper (H100): compute_90a,sm_90a (note the “a” suffix)
  • Blackwell (B200): compute_100a,sm_100a
5

Run the program

./my_gemm
You should see:
GEMM completed successfully!

Building with CMake

For larger projects, use CMake:
CMakeLists.txt
cmake_minimum_required(VERSION 3.19)
project(MyGEMM LANGUAGES CXX CUDA)

set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CXX_STANDARD 17)

find_package(CUDAToolkit REQUIRED)

set(CUTLASS_PATH "path/to/cutlass" CACHE PATH "CUTLASS root directory")
include_directories(${CUTLASS_PATH}/include)

add_executable(my_gemm my_gemm.cu)
set_target_properties(my_gemm PROPERTIES CUDA_ARCHITECTURES "80")
Build:
mkdir build && cd build
cmake .. -DCUTLASS_PATH=${CUTLASS_PATH}
make
./my_gemm

Python quick start

The CUTLASS Python interface provides a high-level API for running CUTLASS kernels from Python.
1

Install the CUTLASS Python package

pip install nvidia-cutlass
Or install from source:
cd ${CUTLASS_PATH}
pip install .
2

Create a Python GEMM script

Create a file named my_gemm.py:
my_gemm.py
import cutlass
import numpy as np

# Create a GEMM operation plan
# This will use FP16 (half precision) with row-major layout
plan = cutlass.op.Gemm(
    element=np.float16,
    layout=cutlass.LayoutType.RowMajor
)

# Create input matrices
M, N, K = 1024, 1024, 1024

# Initialize with random values
A = np.random.randn(M, K).astype(np.float16)
B = np.random.randn(K, N).astype(np.float16)
C = np.zeros((M, N), dtype=np.float16)
D = np.zeros((M, N), dtype=np.float16)  # Output

# Run the GEMM: D = A @ B + C
plan.run(A, B, C, D)

print("GEMM completed successfully!")
print(f"Output shape: {D.shape}")
print(f"Output sample: {D[0, :5]}")

# Verify with NumPy
reference = A @ B + C
error = np.max(np.abs(D - reference))
print(f"Max error vs NumPy: {error}")
3

Run the Python script

python my_gemm.py
Expected output:
GEMM completed successfully!
Output shape: (1024, 1024)
Output sample: [-12.5  8.25 -15.75 ...]
Max error vs NumPy: 0.0625

Advanced Python example

Here’s a more advanced example with mixed precision and custom epilogue:
import cutlass
import numpy as np

# FP16 inputs, FP32 accumulation
plan = cutlass.op.Gemm(
    element_A=np.float16,
    element_B=np.float16,
    element_C=np.float32,
    element_D=np.float32,
    element_accumulator=np.float32,
    layout=cutlass.LayoutType.RowMajor
)

# Create matrices
M, N, K = 2048, 2048, 2048
A = np.random.randn(M, K).astype(np.float16)
B = np.random.randn(K, N).astype(np.float16)
C = np.random.randn(M, N).astype(np.float32)
D = np.zeros((M, N), dtype=np.float32)

# Run with custom alpha and beta: D = 2.0 * A @ B + 1.5 * C
plan.run(A, B, C, D, alpha=2.0, beta=1.5)

print(f"Completed mixed-precision GEMM: {M}x{N}x{K}")

CuTe DSL quick start

CuTe DSL allows you to write CUDA kernels directly in Python.
1

Set up CuTe DSL

cd ${CUTLASS_PATH}/python/CuTeDSL
./setup.sh
2

Write a simple kernel

import cutlass.cute as cute
import numpy as np

# Define a simple elementwise addition kernel
@cute.kernel
def add_kernel(A, B, C):
    # Get thread index
    tid = cute.threadIdx.x + cute.blockIdx.x * cute.blockDim.x
    
    # Perform addition
    if tid < A.size:
        C[tid] = A[tid] + B[tid]

# Allocate and run
N = 1024
A = np.arange(N, dtype=np.float32)
B = np.arange(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)

# Launch kernel
threads_per_block = 256
blocks = (N + threads_per_block - 1) // threads_per_block
add_kernel[blocks, threads_per_block](A, B, C)

print(f"Sum computed: {C[:5]}")

Next steps

Installation guide

Detailed installation instructions for all platforms

C++ examples

Explore 100+ example kernels

Python examples

Jupyter notebooks with Python examples

Performance profiling

Learn to profile and optimize kernels

Common issues

Reduce matrix dimensions or batch size. CUTLASS kernels can be memory-intensive for large problems.
Ensure you’re using C++17 or later:
nvcc -std=c++17 ...
Verify your GPU’s compute capability matches the -gencode flags. For Hopper and Blackwell, use the “a” suffix (e.g., sm_90a).
Ensure cuda-python version matches your CUDA toolkit version:
pip install cuda-python==12.8.0  # Match your CUDA version

Build docs developers (and LLMs) love