Core Operations
CUTLASS Python provides several operation types:- Gemm - General Matrix Multiply operations
- Conv2d - 2D Convolution operations
- GroupedGemm - Batched/grouped GEMM operations
Installation
The CUTLASS Python interface is available through thecutlass package:
Basic Usage Pattern
All CUTLASS Python operations follow a consistent pattern:- Create an operation object with data types and layouts
- Compile the underlying CUDA kernel (optional - can be done implicitly)
- Run the operation with input tensors
Simple Example
Decoupled Compilation
You can separate kernel compilation from execution:Key Concepts
Data Types
CUTLASS supports various data types through thecutlass.DataType enum:
DataType.f16- FP16 (half precision)DataType.f32- FP32 (single precision)DataType.f64- FP64 (double precision)DataType.bf16- BFloat16DataType.e4m3- FP8 E4M3DataType.e5m2- FP8 E5M2DataType.s8- INT8DataType.s32- INT32
torch.float32, numpy.float16).
Layout Types
Matrix layouts are specified usingcutlass.LayoutType:
LayoutType.RowMajor- Row-major layout (C/C++ default)LayoutType.ColumnMajor- Column-major layout (Fortran/BLAS default)
Compute Capability
The API automatically detects your GPU’s compute capability, but you can override:Activation Functions
Activation functions can be fused into epilogues:Asynchronous Execution
Operations can run asynchronously with explicit synchronization:Error Handling
The API performs validation and raises exceptions for:- Incompatible tensor shapes
- Mismatched data types
- Invalid layouts
- Unsupported compute capabilities
Performance Considerations
Memory Management
The API integrates with existing tensor libraries:- PyTorch: Uses torch CUDA tensors directly
- NumPy: Automatically transfers to/from GPU
- CuPy: Uses cupy arrays directly
- RMM: Optional support for RAPIDS Memory Manager
Logging
Enable detailed logging for debugging:Next Steps
- Learn about GEMM operations in detail
- Explore layout types
- Review utility functions