Skip to main content
CUTLASS provides both a C++ header-only library and a Python package. Choose the installation method that best fits your workflow.

System requirements

Minimum requirements

  • GPU: NVIDIA GPU with compute capability 7.0 or higher (Volta architecture or newer)
  • CUDA Toolkit: Version 11.4 or higher
  • Compiler: C++17 compatible compiler
    • GCC 7.5.0 or higher (GCC 8.5.0 has known issues, use 7.5 or 9+)
    • Clang 7.0 or higher
    • MSVC 2019 or higher
  • CMake: Version 3.19 or higher (for building examples and tests)
  • Python: Version 3.8 or higher (for Python interface)
  • CUDA Toolkit: Version 12.8 or higher
  • GPU: Hopper (H100, H200) or Blackwell (B200, B300, RTX 50 series)
  • Compiler: GCC 11.2+ or Clang 14+
  • Python: Version 3.9 or higher
CUTLASS 4.4.1 is optimized for CUDA 12.8+ and performs best on Hopper and Blackwell architectures with access to the latest Tensor Core features.

Supported platforms

Operating systems

Operating SystemCompilerStatus
Ubuntu 18.04GCC 7.5.0Supported
Ubuntu 20.04GCC 10.3.0Supported
Ubuntu 22.04GCC 11.2.0Supported
WindowsMSVC 2019+Known issues
Windows builds have known issues in CUTLASS 4.x. The CUTLASS team is working on fixes. Linux is recommended for production use.

GPU architectures

ArchitectureCompute CapabilityGPUsMin CUDA
Volta7.0V100, Titan V11.4
Turing7.5RTX 20 series, T411.4
Ampere8.0, 8.6A100, RTX 30 series11.4
Ada8.9RTX 40 series, L4011.8
Hopper9.0H100, H20011.8
Blackwell SM10010.0B20012.8
Blackwell SM10310.3B30013.0
Blackwell SM12012.0RTX 50 series12.8
Hopper and Blackwell architectures require the “a” suffix for architecture-accelerated features (e.g., sm_90a, sm_100a) to enable advanced Tensor Core instructions.

C++ installation

CUTLASS is a header-only library - no compilation or installation is required to use it in your projects.
1

Clone the repository

git clone https://github.com/NVIDIA/cutlass.git
cd cutlass
Or download a specific release:
wget https://github.com/NVIDIA/cutlass/archive/refs/tags/v4.4.1.tar.gz
tar -xzf v4.4.1.tar.gz
cd cutlass-4.4.1
2

Set environment variables

Set CUTLASS_PATH for easy reference:
export CUTLASS_PATH=$(pwd)
Set CUDACXX to point to your NVCC compiler:
export CUDACXX=/usr/local/cuda/bin/nvcc
Add to your .bashrc or .zshrc for persistence:
echo 'export CUTLASS_PATH=/path/to/cutlass' >> ~/.bashrc
echo 'export CUDACXX=/usr/local/cuda/bin/nvcc' >> ~/.bashrc
source ~/.bashrc
3

Include in your project

Add CUTLASS headers to your include path:Direct compilation:
nvcc -I${CUTLASS_PATH}/include -std=c++17 your_code.cu -o your_program
CMake project:
CMakeLists.txt
set(CUTLASS_PATH "/path/to/cutlass" CACHE PATH "CUTLASS root")
include_directories(${CUTLASS_PATH}/include)
In your source code:
#include "cutlass/cutlass.h"
#include "cutlass/gemm/device/gemm.h"
4

Verify installation (optional)

Build and run CUTLASS examples to verify your setup:
cd ${CUTLASS_PATH}
mkdir build && cd build

# Build for your GPU architecture (example: Ampere A100)
cmake .. -DCUTLASS_NVCC_ARCHS=80

# Build a basic example
make 00_basic_gemm

# Run the example
./examples/00_basic_gemm/00_basic_gemm
Expected output:
CUTLASS GEMM passed!

CMake integration

For projects using CMake, you can use CUTLASS as an imported target:
find_package(NvidiaCutlass REQUIRED)

add_executable(my_app main.cu)
target_link_libraries(my_app PRIVATE nvidia::cutlass::cutlass)
Or install CUTLASS system-wide:
cd ${CUTLASS_PATH}/build
cmake .. -DCUTLASS_ENABLE_HEADERS_ONLY=ON
sudo make install

Architecture-specific builds

To compile for specific GPU architectures and reduce build time:
# Single architecture (Ampere A100)
cmake .. -DCUTLASS_NVCC_ARCHS=80

# Multiple architectures
cmake .. -DCUTLASS_NVCC_ARCHS="80;89;90a"

# Hopper with architecture-accelerated features
cmake .. -DCUTLASS_NVCC_ARCHS=90a

# Blackwell datacenter
cmake .. -DCUTLASS_NVCC_ARCHS=100a
Always use the “a” suffix for Hopper (90a) and Blackwell (100a, 103a) when using Tensor Core features. Without the suffix, kernels will fail at runtime.

Python installation

The CUTLASS Python interface is distributed as the nvidia-cutlass package.
1

Install from PyPI

The easiest way to install:
pip install nvidia-cutlass
The package name is nvidia-cutlass. Other packages named cutlass are not affiliated with NVIDIA CUTLASS.
2

Install matching cuda-python

Ensure cuda-python version matches your CUDA Toolkit:
# For CUDA 11.8
pip install cuda-python==11.8.0

# For CUDA 12.0
pip install cuda-python==12.0.0

# For CUDA 12.8
pip install cuda-python==12.8.0
Check your CUDA version:
nvcc --version
3

Verify installation

Test the installation:
import cutlass
import numpy as np

# Check version
print(f"CUTLASS version: {cutlass.__version__}")

# Run a simple GEMM
plan = cutlass.op.Gemm(element=np.float16, layout=cutlass.LayoutType.RowMajor)
A = np.ones((128, 128), dtype=np.float16)
B = np.ones((128, 128), dtype=np.float16)
C = np.zeros((128, 128), dtype=np.float16)
D = np.zeros((128, 128), dtype=np.float16)

plan.run(A, B, C, D)
print("CUTLASS Python interface working!")

Install from source

For development or to use the latest features:
1

Clone the repository

git clone https://github.com/NVIDIA/cutlass.git
cd cutlass
2

Set optional environment variables

export CUTLASS_PATH=$(pwd)
export CUDA_INSTALL_PATH=/usr/local/cuda
If not set, these will be inferred automatically.
3

Install the package

For regular installation:
pip install .
For development (changes reflect immediately):
pip install -e .
4

Install additional dependencies

The CUTLASS Python interface requires:
pip install cuda-python>=11.8.0 networkx numpy pydot scipy treelib
These are installed automatically with pip install nvidia-cutlass.

Python requirements

The CUTLASS Python interface has the following dependencies:
[dependencies]
cuda-python>=11.8.0
networkx
numpy
pydot
scipy
treelib
Compatibility:
  • Python: 3.8, 3.9, 3.10, 3.11
  • CUDA: 11.8, 12.0, 12.1, 12.8, 13.0+
  • Platforms: Linux (Ubuntu 18.04+), Windows (experimental)

CuTe DSL installation

CuTe DSL is a Python interface for writing CUDA kernels.
1

Navigate to CuTe DSL directory

cd ${CUTLASS_PATH}/python/CuTeDSL
2

Run the setup script

For CUDA 12.x:
./setup.sh
For CUDA 13.1:
./setup.sh --cu13
3

Verify installation

import cutlass.cute as cute
print(f"CuTe DSL version: {cutlass.__version__}")
print(f"CUDA version: {cutlass.CUDA_VERSION}")
CuTe DSL requires CUTLASS C++ headers to be available. Set CUTLASS_PATH environment variable or install from the CUTLASS repository.

CuTe DSL features

  • CUDA Toolkit 13.1 support with GB300 (SM103) support
  • Ahead-of-Time (AoT) compilation for faster kernel loading
  • JAX integration for use with JAX workflows
  • Experimental API with fragment-free programming model
  • Automatic TMA descriptor generation for Hopper/Blackwell

Docker installation

The easiest way to get started with a complete environment:
# Pull NVIDIA PyTorch container with CUDA and development tools
docker pull nvcr.io/nvidia/pytorch:23.08-py3

# Run container with GPU access
docker run --gpus all -it --rm \
  -v /path/to/cutlass:/workspace/cutlass \
  nvcr.io/nvidia/pytorch:23.08-py3

# Inside container, install CUTLASS Python
cd /workspace/cutlass
pip install .
This container includes:
  • CUDA Toolkit
  • cuDNN and cuBLAS libraries
  • Python with PyTorch
  • Development tools (gcc, cmake, etc.)

Environment setup

Add these to your shell configuration for convenience:
~/.bashrc
# CUTLASS paths
export CUTLASS_PATH=/path/to/cutlass
export CUDA_INSTALL_PATH=/usr/local/cuda
export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc

# Add CUDA to path
export PATH=${CUDA_INSTALL_PATH}/bin:$PATH
export LD_LIBRARY_PATH=${CUDA_INSTALL_PATH}/lib64:$LD_LIBRARY_PATH

# For multi-GPU systems, specify visible devices
export CUDA_VISIBLE_DEVICES=0

Build configuration

For CMake builds, create a configuration file:
cutlass_config.cmake
# Target architectures
set(CUTLASS_NVCC_ARCHS "80;89;90a" CACHE STRING "")

# Enable examples and tests
set(CUTLASS_ENABLE_EXAMPLES ON CACHE BOOL "")
set(CUTLASS_ENABLE_TESTS ON CACHE BOOL "")

# Disable profiler for faster builds
set(CUTLASS_ENABLE_PROFILER OFF CACHE BOOL "")
Use with:
cmake .. -C cutlass_config.cmake

Troubleshooting

Ensure CUDA is installed and nvcc is in your PATH:
which nvcc
nvcc --version
If not found, install CUDA from NVIDIA’s website.
Verify cuda-python matches your CUDA version:
import cuda
print(cuda.__version__)
Reinstall with correct version:
pip uninstall cuda-python
pip install cuda-python==12.8.0  # Match your CUDA version
Check your GCC version:
gcc --version
GCC 8.5.0 has known issues. Use GCC 7.5 or 9+:
# Install alternative GCC version
sudo apt-get install gcc-9 g++-9

# Use with NVCC
nvcc -ccbin g++-9 ...
Install newer CMake:
# Remove old version
sudo apt-get remove cmake

# Install from Kitware
wget https://github.com/Kitware/CMake/releases/download/v3.27.0/cmake-3.27.0-linux-x86_64.sh
sudo sh cmake-3.27.0-linux-x86_64.sh --prefix=/usr/local --skip-license
Ensure you’re using the correct compute capability:
# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv
Use matching architecture flag:
  • 8.0 → -gencode arch=compute_80,code=sm_80
  • 9.0 → -gencode arch=compute_90a,code=sm_90a (note the “a”)
  • 10.0 → -gencode arch=compute_100a,code=sm_100a
Windows support is currently limited. For production use:
  • Use WSL2 with Ubuntu
  • Use Docker Desktop with Linux containers
  • Use a Linux system or VM

Next steps

Quick start guide

Build your first GEMM kernel

C++ examples

Explore example kernels

Python examples

Python notebooks and scripts

Performance guide

Profile and optimize kernels

Getting help

If you encounter issues:

Build docs developers (and LLMs) love