Overview
CUTLASS provides two main Python interfaces for GEMM operations:- CUTLASS Python Interface: High-level API for ease of use
- CuTe DSL: Low-level Python DSL for maximum control and performance
High-Level CUTLASS Python Interface
The CUTLASS Python interface prioritizes ease of use with a simple, high-level API.Installation
Install via PyPI:Basic GEMM Example
Here’s a simple example using the high-level interface:Features
The high-level interface provides:- Simple API: Requires only a few parameters to get started
- Sensible defaults: Automatically selects reasonable kernel configurations
- Configuration enumeration: Lists available configurations for your hardware
- Descriptive exceptions: Python-friendly error messages instead of C++ compile errors
- Framework integration: Easy export to PyTorch CUDA extensions
Supported Operations
- Standard GEMMs
- GEMMs with fused elementwise epilogues (e.g., ReLU)
- Stream K swizzling (pre-SM90)
- Grouped GEMM (pre-SM90)
CuTe DSL GEMM Examples
For advanced users who need maximum control, CUTLASS provides the CuTe DSL, a Python-embedded domain-specific language for writing high-performance kernels.Simple SIMT GEMM (Ampere)
Here’s an example of a dense FP32 GEMM using SIMT operations:Blackwell GEMM with TMA
For the latest Blackwell architecture, CUTLASS provides highly optimized kernels using Tensor Memory Access (TMA):Key Features
SIMT GEMM (Ampere):- FPU-based matrix multiply-accumulate
- Multi-stage pipeline for latency hiding
- Vectorized memory copies
- Bank conflict reduction with padding
- Tensor Memory Access (TMA) for efficient memory operations
- tcgen05.mma instructions for matrix operations
- TMA multicast with cluster for reduced L2 traffic
- Support for 2-CTA instructions
- Multi-stage pipeline
Profiling
Using NVIDIA Nsight Compute
Profile your GEMM kernels with NCU:Performance Tips
- Choose appropriate tile sizes: Balance occupancy and shared memory usage
- Use TMA on Hopper/Blackwell: Significantly reduces memory access overhead
- Enable clustering: Improves L2 cache utilization
- Profile different configurations: Use the profiler to find optimal parameters
Data Type Support
High-Level Interface
- FP32, FP16, BF16
- INT8, INT4
- TF32 (on Ampere+)
CuTe DSL
Ampere SIMT:- FP32
- FP16, BF16, TF32
- INT8, UINT8
- FP8 (E4M3FN, E5M2)
- Mixed precision accumulation (FP32, FP16, INT32)
Next Steps
Grouped GEMM
Learn how to perform batched GEMMs with different problem sizes
Custom Epilogue
Fuse custom operations into the GEMM epilogue
Example Code
Find complete working examples in the CUTLASS repository:- High-level interface:
python/README.md - SIMT GEMM:
examples/python/CuTeDSL/ampere/sgemm.py - Blackwell GEMM:
examples/python/CuTeDSL/blackwell/dense_gemm.py - Hopper GEMM:
examples/python/CuTeDSL/hopper/dense_gemm.py