Python Packages
cutlass_cppgen
High-level Python interface for compiling and running CUTLASS kernels
CuTe DSL
Python DSL for writing custom CUDA kernels using CuTe abstractions
CUTLASS Python Interface (cutlass_cppgen)
The CUTLASS Python interface enables you to compile and run CUTLASS operations from Python with minimal configuration.Key Features
- High-level interfaces requiring only a few parameters
- Automatic selection of sensible default configurations
- Enumeration of known working configurations
- Descriptive Python exceptions instead of C++ compile errors
- Easy export to framework extensions (PyTorch CUDA extensions)
Quick Example
Supported Operations
- GEMMs - General matrix multiplication
- Fused Epilogues - GEMMs with elementwise operations (e.g., ReLU) for pre-SM90
- Stream K - Stream K swizzling for pre-SM90 kernels
- Grouped GEMM - Multiple GEMMs in a single kernel for pre-SM90
Design Philosophy
Goals
Ease of Use
Ease of Use
Present high-level interfaces that require minimal parameters and automatically select sensible defaults
Discoverability
Discoverability
Enumerate configurations known to work in a given setting
Better Error Messages
Better Error Messages
Emit descriptive Python runtime exceptions instead of C++ compile-time errors where possible
Framework Integration
Framework Integration
Simplify exporting CUTLASS kernels to deep learning framework extensions
Non-Goals
The CUTLASS Python interface does not intend to:-
Select optimal kernel configurations - Default selections may not achieve highest performance. Users should:
- Profile different parameter combinations, or
- Use optimized libraries like cuBLAS
-
Act as a fast container - Does not minimize Python overhead. For deployment:
- Use the emitted C++ code directly, or
- Use framework extension emitters
- Be a JIT compilation engine - Enables CUTLASS in Python but doesn’t aim to be a Python-to-CUDA JIT framework
Comparison to PyCUTLASS
The CUTLASS Python interface builds on PyCUTLASS but provides a higher-level API:| Feature | PyCUTLASS | CUTLASS Python Interface |
|---|---|---|
| Configuration | Exhaustive template parameters | Minimal high-level parameters |
| Flexibility | Maximum (similar to C++ API) | Focused on common use cases |
| Ease of Use | Requires detailed knowledge | Simplified with smart defaults |
| Learning Curve | Steep | Gentle |
CuTe DSL
The CuTe DSL (Domain-Specific Language) is a Python-based framework for writing high-performance CUDA kernels using CuTe’s layout algebra and tensor abstractions.Learn More
Explore the CuTe DSL documentation for kernel development
Key Capabilities
- Write CUDA kernels in Python using
@cute.kerneldecorator - Express complex tensor layouts with layout algebra
- Utilize hardware features (Tensor Cores, TMA, async pipelines)
- JIT compilation to optimized PTX/SASS
- Integration with PyTorch, JAX, and NumPy
Quick Example
Installation
- cutlass_cppgen (PyPI)
- CuTe DSL
- From Source
- Docker
Any packages named
cutlass (without nvidia- prefix) are not affiliated with NVIDIA CUTLASS.Requirements
CUDA
- CUDA 11.8, 12.0, 12.1+
- Matching
cuda-pythonversion
Python
- Python 3.8, 3.9, 3.10+
- PyTorch (optional, for integration)
GPU
- Ampere (SM80+) for basic features
- Hopper (SM90) for advanced features
- Blackwell for latest features
Environment
CUTLASS_PATH(optional)CUDA_INSTALL_PATH(optional)
Next Steps
Quickstart
Get started with CUTLASS Python in minutes
CuTe DSL
Learn the CuTe DSL for custom kernels
Examples
Explore example kernels and notebooks
PyTorch Integration
Integrate with PyTorch workflows