Skip to main content
The CUTLASS Python API provides a high-level interface for constructing, compiling, and running CUDA kernels without specifying many configuration parameters. The API automatically selects sensible defaults for template parameters.

Core Operations

CUTLASS Python provides several operation types:
  • Gemm - General Matrix Multiply operations
  • Conv2d - 2D Convolution operations
  • GroupedGemm - Batched/grouped GEMM operations

Installation

The CUTLASS Python interface is available through the cutlass package:
import cutlass
from cutlass.op import Gemm

Basic Usage Pattern

All CUTLASS Python operations follow a consistent pattern:
  1. Create an operation object with data types and layouts
  2. Compile the underlying CUDA kernel (optional - can be done implicitly)
  3. Run the operation with input tensors

Simple Example

import torch
import cutlass
from cutlass.op import Gemm

# Create input tensors
A = torch.randn((128, 256), device='cuda', dtype=torch.float16)
B = torch.randn((256, 64), device='cuda', dtype=torch.float16)
C = torch.zeros((128, 64), device='cuda', dtype=torch.float16)
D = torch.zeros((128, 64), device='cuda', dtype=torch.float16)

# Create and run GEMM operation
plan = Gemm(A=A, B=B, C=C, D=D)
plan.run()

Decoupled Compilation

You can separate kernel compilation from execution:
import numpy as np
import cutlass
from cutlass.op import Gemm

# Create operation with data types
plan = Gemm(
    element=np.float32, 
    layout=cutlass.LayoutType.RowMajor
)

# Compile kernel once
plan.compile()

# Run multiple times with different tensors
for i in range(100):
    plan.run(A_batch[i], B_batch[i], C_batch[i], D_batch[i])

Key Concepts

Data Types

CUTLASS supports various data types through the cutlass.DataType enum:
  • DataType.f16 - FP16 (half precision)
  • DataType.f32 - FP32 (single precision)
  • DataType.f64 - FP64 (double precision)
  • DataType.bf16 - BFloat16
  • DataType.e4m3 - FP8 E4M3
  • DataType.e5m2 - FP8 E5M2
  • DataType.s8 - INT8
  • DataType.s32 - INT32
You can also use native tensor types (e.g., torch.float32, numpy.float16).

Layout Types

Matrix layouts are specified using cutlass.LayoutType:
  • LayoutType.RowMajor - Row-major layout (C/C++ default)
  • LayoutType.ColumnMajor - Column-major layout (Fortran/BLAS default)

Compute Capability

The API automatically detects your GPU’s compute capability, but you can override:
plan = Gemm(
    element=torch.float32,
    cc=90,  # Target SM90 (H100)
    kernel_cc=80  # Use SM80 kernel implementation
)

Activation Functions

Activation functions can be fused into epilogues:
import cutlass
from cutlass.op import Gemm
from cutlass import epilogue

plan = Gemm(element=torch.float32)
plan.activation = epilogue.relu

# Available activations:
# - epilogue.relu
# - epilogue.gelu
# - epilogue.sigmoid
# - epilogue.tanh
# - epilogue.silu
# - epilogue.hardswish
# - epilogue.leaky_relu

Asynchronous Execution

Operations can run asynchronously with explicit synchronization:
plan = Gemm(element=torch.float32)

# Launch kernel asynchronously
args = plan.run(A, B, C, D, sync=False)

# Do other work...

# Wait for completion
args.sync()

Error Handling

The API performs validation and raises exceptions for:
  • Incompatible tensor shapes
  • Mismatched data types
  • Invalid layouts
  • Unsupported compute capabilities
try:
    plan = Gemm(A=A, B=B, C=C, D=D)
    plan.run()
except Exception as e:
    print(f"CUTLASS error: {e}")

Performance Considerations

The high-level Python API prioritizes ease of use over optimal performance. For production workloads requiring maximum performance, consider:
  1. Explicitly specifying tile descriptions and kernel schedules
  2. Using the lower-level cutlass.backend API
  3. Tuning kernel parameters for your specific workload

Memory Management

The API integrates with existing tensor libraries:
  • PyTorch: Uses torch CUDA tensors directly
  • NumPy: Automatically transfers to/from GPU
  • CuPy: Uses cupy arrays directly
  • RMM: Optional support for RAPIDS Memory Manager
# Using PyTorch tensors (recommended for GPU)
A_torch = torch.randn((M, K), device='cuda')
plan = Gemm(A=A_torch, ...)

# Using NumPy arrays (involves copy to GPU)
A_numpy = np.random.randn(M, K).astype(np.float32)
plan = Gemm(A=A_numpy, ...)

Logging

Enable detailed logging for debugging:
import logging
import cutlass

cutlass.set_log_level(logging.DEBUG)

Next Steps

Build docs developers (and LLMs) love