Skip to main content
CUTLASS is a collection of CUDA C++ and Python template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement, enabling developers to achieve peak performance on NVIDIA GPUs.

Quick start

Get up and running with CUTLASS in minutes

Installation

Install CUTLASS for C++ or Python

C++ documentation

Explore the C++ API and templates

Python DSL

Write CUDA kernels in Python with CuTe DSL

What is CUTLASS?

CUTLASS decomposes GPU linear algebra operations into reusable, modular software components at different levels of the parallelization hierarchy. Primitives for different levels can be specialized and tuned via custom tiling sizes, data types, and algorithmic policies. This flexibility simplifies their use as building blocks within custom kernels and applications.

Two programming models

CUTLASS offers two complementary approaches: CUTLASS C++ Templates - Template abstractions providing extensive support for:
  • Mixed-precision computations (FP64, FP32, TF32, FP16, BF16, FP8, FP4, INT8, INT4)
  • Block-scaled data types (NVIDIA NVFP4, OCP MXFP4/MXFP6/MXFP8)
  • Specialized data-movement (async copy) and multiply-accumulate abstractions
  • Support for Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures
CuTe DSL (Python) - A Python native interface for writing high-performance CUDA kernels:
  • Write kernels in Python without performance compromises
  • Orders of magnitude faster compile times vs C++
  • Native integration with deep learning frameworks
  • Intuitive metaprogramming without deep C++ expertise
  • Fully consistent with CuTe C++ abstractions

Key features

Peak performance

Achieves nearly optimal utilization of theoretical peak throughput across all supported GPU architectures

Hierarchical decomposition

Modular components at thread, warp, threadblock, and device levels

Extensive data type support

From FP64 to binary 1-bit types, including block-scaled formats

CuTe layout algebra

Powerful abstractions for describing and manipulating tensors of threads and data

Tensor Core acceleration

Optimized support for programmable Tensor Cores on modern NVIDIA GPUs

Header-only library

Easy integration - just point your compiler at the include directory

Performance

CUTLASS primitives are extremely efficient. When used to construct device-wide GEMM kernels, they exhibit nearly optimal utilization of peak theoretical throughput across various data types and GPU architectures.
On NVIDIA Blackwell SM100 GPUs, CUTLASS 3.8 achieves 90%+ of theoretical peak performance across FP64, TF32, FP16, BF16, FP8, and INT8 operations.

Who uses CUTLASS?

CUTLASS is designed for:
  • ML/AI developers building custom operators for deep learning frameworks
  • HPC researchers implementing specialized linear algebra kernels
  • Performance engineers optimizing GPU-accelerated applications
  • Students and researchers learning GPU programming and optimization
  • Framework developers integrating high-performance primitives into libraries

Architecture support

CUTLASS supports NVIDIA GPUs from compute capability 7.0 onwards:
ArchitectureCompute CapabilityExample GPUs
Volta7.0V100, Titan V
Turing7.5RTX 20 series, T4
Ampere8.0, 8.6A100, RTX 30 series
Ada8.9RTX 40 series, L40
Hopper9.0H100, H200
Blackwell10.0, 10.3, 11.0, 12.0B200, B300, RTX 50 series

What’s included

The CUTLASS project includes:
  • Header-only template library - Core CUTLASS and CuTe abstractions
  • 100+ examples - Demonstrating various operations and optimizations
  • Python interface - High-level API for compiling and running kernels from Python
  • CuTe Python DSL - Write CUDA kernels in Python
  • Profiler tool - Command-line utility for benchmarking kernels
  • Comprehensive documentation - Guides, API references, and tutorials
  • Unit tests - Extensive test suite ensuring correctness

Getting started

Ready to start using CUTLASS? Choose your path:

Quick start guide

Jump right in with a working GEMM example

Installation guide

Set up CUTLASS on your system

License

CUTLASS is released by NVIDIA Corporation as open source software under the BSD 3-Clause License.

Build docs developers (and LLMs) love