Installation

CUTLASS provides both a C++ header-only library and a Python package. Choose the installation method that best fits your workflow.

System requirements

Minimum requirements

GPU: NVIDIA GPU with compute capability 7.0 or higher (Volta architecture or newer)
CUDA Toolkit: Version 11.4 or higher
Compiler: C++17 compatible compiler
- GCC 7.5.0 or higher (GCC 8.5.0 has known issues, use 7.5 or 9+)
- Clang 7.0 or higher
- MSVC 2019 or higher
CMake: Version 3.19 or higher (for building examples and tests)
Python: Version 3.8 or higher (for Python interface)

Recommended requirements

CUDA Toolkit: Version 12.8 or higher
GPU: Hopper (H100, H200) or Blackwell (B200, B300, RTX 50 series)
Compiler: GCC 11.2+ or Clang 14+
Python: Version 3.9 or higher

CUTLASS 4.4.1 is optimized for CUDA 12.8+ and performs best on Hopper and Blackwell architectures with access to the latest Tensor Core features.

Supported platforms

Operating systems

Operating System	Compiler	Status
Ubuntu 18.04	GCC 7.5.0	Supported
Ubuntu 20.04	GCC 10.3.0	Supported
Ubuntu 22.04	GCC 11.2.0	Supported
Windows	MSVC 2019+	Known issues

Windows builds have known issues in CUTLASS 4.x. The CUTLASS team is working on fixes. Linux is recommended for production use.

GPU architectures

Architecture	Compute Capability	GPUs	Min CUDA
Volta	7.0	V100, Titan V	11.4
Turing	7.5	RTX 20 series, T4	11.4
Ampere	8.0, 8.6	A100, RTX 30 series	11.4
Ada	8.9	RTX 40 series, L40	11.8
Hopper	9.0	H100, H200	11.8
Blackwell SM100	10.0	B200	12.8
Blackwell SM103	10.3	B300	13.0
Blackwell SM120	12.0	RTX 50 series	12.8

Hopper and Blackwell architectures require the “a” suffix for architecture-accelerated features (e.g., sm_90a, sm_100a) to enable advanced Tensor Core instructions.

C++ installation

CUTLASS is a header-only library - no compilation or installation is required to use it in your projects.

Clone the repository

git clone https://github.com/NVIDIA/cutlass.git
cd cutlass

Or download a specific release:

wget https://github.com/NVIDIA/cutlass/archive/refs/tags/v4.4.1.tar.gz
tar -xzf v4.4.1.tar.gz
cd cutlass-4.4.1

Set environment variables

Set CUTLASS_PATH for easy reference:

export CUTLASS_PATH=$(pwd)

Set CUDACXX to point to your NVCC compiler:

export CUDACXX=/usr/local/cuda/bin/nvcc

Add to your .bashrc or .zshrc for persistence:

echo 'export CUTLASS_PATH=/path/to/cutlass' >> ~/.bashrc
echo 'export CUDACXX=/usr/local/cuda/bin/nvcc' >> ~/.bashrc
source ~/.bashrc

Include in your project

Add CUTLASS headers to your include path:Direct compilation:

nvcc -I${CUTLASS_PATH}/include -std=c++17 your_code.cu -o your_program

CMake project:

CMakeLists.txt

set(CUTLASS_PATH "/path/to/cutlass" CACHE PATH "CUTLASS root")
include_directories(${CUTLASS_PATH}/include)

In your source code:

#include "cutlass/cutlass.h"
#include "cutlass/gemm/device/gemm.h"

Verify installation (optional)

Build and run CUTLASS examples to verify your setup:

cd ${CUTLASS_PATH}
mkdir build && cd build

# Build for your GPU architecture (example: Ampere A100)
cmake .. -DCUTLASS_NVCC_ARCHS=80

# Build a basic example
make 00_basic_gemm

# Run the example
./examples/00_basic_gemm/00_basic_gemm

Expected output:

CUTLASS GEMM passed!

CMake integration

For projects using CMake, you can use CUTLASS as an imported target:

find_package(NvidiaCutlass REQUIRED)

add_executable(my_app main.cu)
target_link_libraries(my_app PRIVATE nvidia::cutlass::cutlass)

Or install CUTLASS system-wide:

cd ${CUTLASS_PATH}/build
cmake .. -DCUTLASS_ENABLE_HEADERS_ONLY=ON
sudo make install

Architecture-specific builds

To compile for specific GPU architectures and reduce build time:

# Single architecture (Ampere A100)
cmake .. -DCUTLASS_NVCC_ARCHS=80

# Multiple architectures
cmake .. -DCUTLASS_NVCC_ARCHS="80;89;90a"

# Hopper with architecture-accelerated features
cmake .. -DCUTLASS_NVCC_ARCHS=90a

# Blackwell datacenter
cmake .. -DCUTLASS_NVCC_ARCHS=100a

Always use the “a” suffix for Hopper (90a) and Blackwell (100a, 103a) when using Tensor Core features. Without the suffix, kernels will fail at runtime.

Python installation

The CUTLASS Python interface is distributed as the nvidia-cutlass package.

Install from PyPI

The easiest way to install:

pip install nvidia-cutlass

The package name is nvidia-cutlass. Other packages named cutlass are not affiliated with NVIDIA CUTLASS.

Install matching cuda-python

Ensure cuda-python version matches your CUDA Toolkit:

# For CUDA 11.8
pip install cuda-python==11.8.0

# For CUDA 12.0
pip install cuda-python==12.0.0

# For CUDA 12.8
pip install cuda-python==12.8.0

Check your CUDA version:

nvcc --version

Verify installation

Test the installation:

import cutlass
import numpy as np

# Check version
print(f"CUTLASS version: {cutlass.__version__}")

# Run a simple GEMM
plan = cutlass.op.Gemm(element=np.float16, layout=cutlass.LayoutType.RowMajor)
A = np.ones((128, 128), dtype=np.float16)
B = np.ones((128, 128), dtype=np.float16)
C = np.zeros((128, 128), dtype=np.float16)
D = np.zeros((128, 128), dtype=np.float16)

plan.run(A, B, C, D)
print("CUTLASS Python interface working!")

Install from source

For development or to use the latest features:

Clone the repository

git clone https://github.com/NVIDIA/cutlass.git
cd cutlass

Set optional environment variables

export CUTLASS_PATH=$(pwd)
export CUDA_INSTALL_PATH=/usr/local/cuda

If not set, these will be inferred automatically.

Install the package

For regular installation:

pip install .

For development (changes reflect immediately):

pip install -e .

Install additional dependencies

The CUTLASS Python interface requires:

pip install cuda-python>=11.8.0 networkx numpy pydot scipy treelib

These are installed automatically with pip install nvidia-cutlass.

Python requirements

The CUTLASS Python interface has the following dependencies:

[dependencies]
cuda-python>=11.8.0
networkx
numpy
pydot
scipy
treelib

Compatibility:

Python: 3.8, 3.9, 3.10, 3.11
CUDA: 11.8, 12.0, 12.1, 12.8, 13.0+
Platforms: Linux (Ubuntu 18.04+), Windows (experimental)

CuTe DSL installation

CuTe DSL is a Python interface for writing CUDA kernels.

Navigate to CuTe DSL directory

cd ${CUTLASS_PATH}/python/CuTeDSL

Run the setup script

For CUDA 12.x:

./setup.sh

For CUDA 13.1:

./setup.sh --cu13

Verify installation

import cutlass.cute as cute
print(f"CuTe DSL version: {cutlass.__version__}")
print(f"CUDA version: {cutlass.CUDA_VERSION}")

CuTe DSL requires CUTLASS C++ headers to be available. Set CUTLASS_PATH environment variable or install from the CUTLASS repository.

CuTe DSL features

CUDA Toolkit 13.1 support with GB300 (SM103) support
Ahead-of-Time (AoT) compilation for faster kernel loading
JAX integration for use with JAX workflows
Experimental API with fragment-free programming model
Automatic TMA descriptor generation for Hopper/Blackwell

Docker installation

The easiest way to get started with a complete environment:

# Pull NVIDIA PyTorch container with CUDA and development tools
docker pull nvcr.io/nvidia/pytorch:23.08-py3

# Run container with GPU access
docker run --gpus all -it --rm \
  -v /path/to/cutlass:/workspace/cutlass \
  nvcr.io/nvidia/pytorch:23.08-py3

# Inside container, install CUTLASS Python
cd /workspace/cutlass
pip install .

This container includes:

CUDA Toolkit
cuDNN and cuBLAS libraries
Python with PyTorch
Development tools (gcc, cmake, etc.)

Environment setup

Recommended environment variables

Add these to your shell configuration for convenience:

~/.bashrc

# CUTLASS paths
export CUTLASS_PATH=/path/to/cutlass
export CUDA_INSTALL_PATH=/usr/local/cuda
export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc

# Add CUDA to path
export PATH=${CUDA_INSTALL_PATH}/bin:$PATH
export LD_LIBRARY_PATH=${CUDA_INSTALL_PATH}/lib64:$LD_LIBRARY_PATH

# For multi-GPU systems, specify visible devices
export CUDA_VISIBLE_DEVICES=0

Build configuration

For CMake builds, create a configuration file:

cutlass_config.cmake

# Target architectures
set(CUTLASS_NVCC_ARCHS "80;89;90a" CACHE STRING "")

# Enable examples and tests
set(CUTLASS_ENABLE_EXAMPLES ON CACHE BOOL "")
set(CUTLASS_ENABLE_TESTS ON CACHE BOOL "")

# Disable profiler for faster builds
set(CUTLASS_ENABLE_PROFILER OFF CACHE BOOL "")

Use with:

cmake .. -C cutlass_config.cmake

Troubleshooting

CUDA Toolkit not found

Ensure CUDA is installed and nvcc is in your PATH:

which nvcc
nvcc --version

If not found, install CUDA from NVIDIA’s website.

Python import errors

Verify cuda-python matches your CUDA version:

import cuda
print(cuda.__version__)

Reinstall with correct version:

pip uninstall cuda-python
pip install cuda-python==12.8.0  # Match your CUDA version

Compiler version issues

Check your GCC version:

gcc --version

GCC 8.5.0 has known issues. Use GCC 7.5 or 9+:

# Install alternative GCC version
sudo apt-get install gcc-9 g++-9

# Use with NVCC
nvcc -ccbin g++-9 ...

CMake version too old

Install newer CMake:

# Remove old version
sudo apt-get remove cmake

# Install from Kitware
wget https://github.com/Kitware/CMake/releases/download/v3.27.0/cmake-3.27.0-linux-x86_64.sh
sudo sh cmake-3.27.0-linux-x86_64.sh --prefix=/usr/local --skip-license

Architecture mismatch errors

Ensure you’re using the correct compute capability:

# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv

Use matching architecture flag:

8.0 → -gencode arch=compute_80,code=sm_80
9.0 → -gencode arch=compute_90a,code=sm_90a (note the “a”)
10.0 → -gencode arch=compute_100a,code=sm_100a

Windows build failures

Windows support is currently limited. For production use:

Use WSL2 with Ubuntu
Use Docker Desktop with Linux containers
Use a Linux system or VM

Next steps

Quick start guide

Build your first GEMM kernel

C++ examples

Explore example kernels

Python examples

Python notebooks and scripts

Performance guide

Profile and optimize kernels

Getting help

If you encounter issues:

Check the CUTLASS GitHub Issues
Review the CUTLASS documentation
Ask questions on NVIDIA Developer Forums

Get Started

Core Concepts

C++ API

Python DSL

Architecture Support

Performance

System requirements

Minimum requirements

Recommended requirements

Supported platforms

Operating systems

GPU architectures

C++ installation

CMake integration

Architecture-specific builds

Python installation

Install from source

Python requirements

CuTe DSL installation

CuTe DSL features

Docker installation

Environment setup

Recommended environment variables

Build configuration

Troubleshooting

Next steps

Quick start guide

C++ examples

Python examples

Performance guide

Getting help

Build docs developers (and LLMs) love

Get Started

Core Concepts

C++ API

Python DSL

Architecture Support

Performance

​System requirements

​Minimum requirements

​Recommended requirements

​Supported platforms

​Operating systems

​GPU architectures

​C++ installation

​CMake integration

​Architecture-specific builds

​Python installation

​Install from source

​Python requirements

​CuTe DSL installation

​CuTe DSL features

​Docker installation

​Environment setup

​Recommended environment variables

​Build configuration

​Troubleshooting

​Next steps

Quick start guide

C++ examples

Python examples

Performance guide

​Getting help

Build docs developers (and LLMs) love

System requirements

Minimum requirements

Recommended requirements

Supported platforms

Operating systems

GPU architectures

C++ installation

CMake integration

Architecture-specific builds

Python installation

Install from source

Python requirements

CuTe DSL installation

CuTe DSL features

Docker installation

Environment setup

Recommended environment variables

Build configuration

Troubleshooting

Next steps

Getting help