System requirements
Minimum requirements
- GPU: NVIDIA GPU with compute capability 7.0 or higher (Volta architecture or newer)
- CUDA Toolkit: Version 11.4 or higher
- Compiler: C++17 compatible compiler
- GCC 7.5.0 or higher (GCC 8.5.0 has known issues, use 7.5 or 9+)
- Clang 7.0 or higher
- MSVC 2019 or higher
- CMake: Version 3.19 or higher (for building examples and tests)
- Python: Version 3.8 or higher (for Python interface)
Recommended requirements
- CUDA Toolkit: Version 12.8 or higher
- GPU: Hopper (H100, H200) or Blackwell (B200, B300, RTX 50 series)
- Compiler: GCC 11.2+ or Clang 14+
- Python: Version 3.9 or higher
CUTLASS 4.4.1 is optimized for CUDA 12.8+ and performs best on Hopper and Blackwell architectures with access to the latest Tensor Core features.
Supported platforms
Operating systems
| Operating System | Compiler | Status |
|---|---|---|
| Ubuntu 18.04 | GCC 7.5.0 | Supported |
| Ubuntu 20.04 | GCC 10.3.0 | Supported |
| Ubuntu 22.04 | GCC 11.2.0 | Supported |
| Windows | MSVC 2019+ | Known issues |
GPU architectures
| Architecture | Compute Capability | GPUs | Min CUDA |
|---|---|---|---|
| Volta | 7.0 | V100, Titan V | 11.4 |
| Turing | 7.5 | RTX 20 series, T4 | 11.4 |
| Ampere | 8.0, 8.6 | A100, RTX 30 series | 11.4 |
| Ada | 8.9 | RTX 40 series, L40 | 11.8 |
| Hopper | 9.0 | H100, H200 | 11.8 |
| Blackwell SM100 | 10.0 | B200 | 12.8 |
| Blackwell SM103 | 10.3 | B300 | 13.0 |
| Blackwell SM120 | 12.0 | RTX 50 series | 12.8 |
Hopper and Blackwell architectures require the “a” suffix for architecture-accelerated features (e.g.,
sm_90a, sm_100a) to enable advanced Tensor Core instructions.C++ installation
CUTLASS is a header-only library - no compilation or installation is required to use it in your projects.Set environment variables
Set Set Add to your
CUTLASS_PATH for easy reference:CUDACXX to point to your NVCC compiler:.bashrc or .zshrc for persistence:Include in your project
Add CUTLASS headers to your include path:Direct compilation:CMake project:In your source code:
CMakeLists.txt
CMake integration
For projects using CMake, you can use CUTLASS as an imported target:Architecture-specific builds
To compile for specific GPU architectures and reduce build time:Python installation
The CUTLASS Python interface is distributed as thenvidia-cutlass package.
Install from PyPI
The easiest way to install:
The package name is
nvidia-cutlass. Other packages named cutlass are not affiliated with NVIDIA CUTLASS.Install matching cuda-python
Ensure Check your CUDA version:
cuda-python version matches your CUDA Toolkit:Install from source
For development or to use the latest features:Python requirements
The CUTLASS Python interface has the following dependencies:- Python: 3.8, 3.9, 3.10, 3.11
- CUDA: 11.8, 12.0, 12.1, 12.8, 13.0+
- Platforms: Linux (Ubuntu 18.04+), Windows (experimental)
CuTe DSL installation
CuTe DSL is a Python interface for writing CUDA kernels.CuTe DSL requires CUTLASS C++ headers to be available. Set
CUTLASS_PATH environment variable or install from the CUTLASS repository.CuTe DSL features
- CUDA Toolkit 13.1 support with GB300 (SM103) support
- Ahead-of-Time (AoT) compilation for faster kernel loading
- JAX integration for use with JAX workflows
- Experimental API with fragment-free programming model
- Automatic TMA descriptor generation for Hopper/Blackwell
Docker installation
The easiest way to get started with a complete environment:- CUDA Toolkit
- cuDNN and cuBLAS libraries
- Python with PyTorch
- Development tools (gcc, cmake, etc.)
Environment setup
Recommended environment variables
Add these to your shell configuration for convenience:~/.bashrc
Build configuration
For CMake builds, create a configuration file:cutlass_config.cmake
Troubleshooting
CUDA Toolkit not found
CUDA Toolkit not found
Python import errors
Python import errors
Verify Reinstall with correct version:
cuda-python matches your CUDA version:Compiler version issues
Compiler version issues
Check your GCC version:GCC 8.5.0 has known issues. Use GCC 7.5 or 9+:
CMake version too old
CMake version too old
Install newer CMake:
Architecture mismatch errors
Architecture mismatch errors
Ensure you’re using the correct compute capability:Use matching architecture flag:
- 8.0 →
-gencode arch=compute_80,code=sm_80 - 9.0 →
-gencode arch=compute_90a,code=sm_90a(note the “a”) - 10.0 →
-gencode arch=compute_100a,code=sm_100a
Windows build failures
Windows build failures
Windows support is currently limited. For production use:
- Use WSL2 with Ubuntu
- Use Docker Desktop with Linux containers
- Use a Linux system or VM
Next steps
Quick start guide
Build your first GEMM kernel
C++ examples
Explore example kernels
Python examples
Python notebooks and scripts
Performance guide
Profile and optimize kernels
Getting help
If you encounter issues:- Check the CUTLASS GitHub Issues
- Review the CUTLASS documentation
- Ask questions on NVIDIA Developer Forums