Quick start
Get up and running with CUTLASS in minutes
Installation
Install CUTLASS for C++ or Python
C++ documentation
Explore the C++ API and templates
Python DSL
Write CUDA kernels in Python with CuTe DSL
What is CUTLASS?
CUTLASS decomposes GPU linear algebra operations into reusable, modular software components at different levels of the parallelization hierarchy. Primitives for different levels can be specialized and tuned via custom tiling sizes, data types, and algorithmic policies. This flexibility simplifies their use as building blocks within custom kernels and applications.Two programming models
CUTLASS offers two complementary approaches: CUTLASS C++ Templates - Template abstractions providing extensive support for:- Mixed-precision computations (FP64, FP32, TF32, FP16, BF16, FP8, FP4, INT8, INT4)
- Block-scaled data types (NVIDIA NVFP4, OCP MXFP4/MXFP6/MXFP8)
- Specialized data-movement (async copy) and multiply-accumulate abstractions
- Support for Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures
- Write kernels in Python without performance compromises
- Orders of magnitude faster compile times vs C++
- Native integration with deep learning frameworks
- Intuitive metaprogramming without deep C++ expertise
- Fully consistent with CuTe C++ abstractions
Key features
Peak performance
Achieves nearly optimal utilization of theoretical peak throughput across all supported GPU architectures
Hierarchical decomposition
Modular components at thread, warp, threadblock, and device levels
Extensive data type support
From FP64 to binary 1-bit types, including block-scaled formats
CuTe layout algebra
Powerful abstractions for describing and manipulating tensors of threads and data
Tensor Core acceleration
Optimized support for programmable Tensor Cores on modern NVIDIA GPUs
Header-only library
Easy integration - just point your compiler at the include directory
Performance
CUTLASS primitives are extremely efficient. When used to construct device-wide GEMM kernels, they exhibit nearly optimal utilization of peak theoretical throughput across various data types and GPU architectures.On NVIDIA Blackwell SM100 GPUs, CUTLASS 3.8 achieves 90%+ of theoretical peak performance across FP64, TF32, FP16, BF16, FP8, and INT8 operations.
Who uses CUTLASS?
CUTLASS is designed for:- ML/AI developers building custom operators for deep learning frameworks
- HPC researchers implementing specialized linear algebra kernels
- Performance engineers optimizing GPU-accelerated applications
- Students and researchers learning GPU programming and optimization
- Framework developers integrating high-performance primitives into libraries
Architecture support
CUTLASS supports NVIDIA GPUs from compute capability 7.0 onwards:| Architecture | Compute Capability | Example GPUs |
|---|---|---|
| Volta | 7.0 | V100, Titan V |
| Turing | 7.5 | RTX 20 series, T4 |
| Ampere | 8.0, 8.6 | A100, RTX 30 series |
| Ada | 8.9 | RTX 40 series, L40 |
| Hopper | 9.0 | H100, H200 |
| Blackwell | 10.0, 10.3, 11.0, 12.0 | B200, B300, RTX 50 series |
What’s included
The CUTLASS project includes:- Header-only template library - Core CUTLASS and CuTe abstractions
- 100+ examples - Demonstrating various operations and optimizations
- Python interface - High-level API for compiling and running kernels from Python
- CuTe Python DSL - Write CUDA kernels in Python
- Profiler tool - Command-line utility for benchmarking kernels
- Comprehensive documentation - Guides, API references, and tutorials
- Unit tests - Extensive test suite ensuring correctness
Getting started
Ready to start using CUTLASS? Choose your path:Quick start guide
Jump right in with a working GEMM example
Installation guide
Set up CUTLASS on your system