Python Operations Overview

The CUTLASS Python API provides a high-level interface for constructing, compiling, and running CUDA kernels without specifying many configuration parameters. The API automatically selects sensible defaults for template parameters.

Core Operations

CUTLASS Python provides several operation types:

Gemm - General Matrix Multiply operations
Conv2d - 2D Convolution operations
GroupedGemm - Batched/grouped GEMM operations

Installation

The CUTLASS Python interface is available through the cutlass package:

import cutlass
from cutlass.op import Gemm

Basic Usage Pattern

All CUTLASS Python operations follow a consistent pattern:

Create an operation object with data types and layouts
Compile the underlying CUDA kernel (optional - can be done implicitly)
Run the operation with input tensors

Simple Example

import torch
import cutlass
from cutlass.op import Gemm

# Create input tensors
A = torch.randn((128, 256), device='cuda', dtype=torch.float16)
B = torch.randn((256, 64), device='cuda', dtype=torch.float16)
C = torch.zeros((128, 64), device='cuda', dtype=torch.float16)
D = torch.zeros((128, 64), device='cuda', dtype=torch.float16)

# Create and run GEMM operation
plan = Gemm(A=A, B=B, C=C, D=D)
plan.run()

Decoupled Compilation

You can separate kernel compilation from execution:

import numpy as np
import cutlass
from cutlass.op import Gemm

# Create operation with data types
plan = Gemm(
    element=np.float32, 
    layout=cutlass.LayoutType.RowMajor
)

# Compile kernel once
plan.compile()

# Run multiple times with different tensors
for i in range(100):
    plan.run(A_batch[i], B_batch[i], C_batch[i], D_batch[i])

Key Concepts

Data Types

CUTLASS supports various data types through the cutlass.DataType enum:

DataType.f16 - FP16 (half precision)
DataType.f32 - FP32 (single precision)
DataType.f64 - FP64 (double precision)
DataType.bf16 - BFloat16
DataType.e4m3 - FP8 E4M3
DataType.e5m2 - FP8 E5M2
DataType.s8 - INT8
DataType.s32 - INT32

You can also use native tensor types (e.g., torch.float32, numpy.float16).

Layout Types

Matrix layouts are specified using cutlass.LayoutType:

LayoutType.RowMajor - Row-major layout (C/C++ default)
LayoutType.ColumnMajor - Column-major layout (Fortran/BLAS default)

Compute Capability

The API automatically detects your GPU’s compute capability, but you can override:

plan = Gemm(
    element=torch.float32,
    cc=90,  # Target SM90 (H100)
    kernel_cc=80  # Use SM80 kernel implementation
)

Activation Functions

Activation functions can be fused into epilogues:

import cutlass
from cutlass.op import Gemm
from cutlass import epilogue

plan = Gemm(element=torch.float32)
plan.activation = epilogue.relu

# Available activations:
# - epilogue.relu
# - epilogue.gelu
# - epilogue.sigmoid
# - epilogue.tanh
# - epilogue.silu
# - epilogue.hardswish
# - epilogue.leaky_relu

Asynchronous Execution

Operations can run asynchronously with explicit synchronization:

plan = Gemm(element=torch.float32)

# Launch kernel asynchronously
args = plan.run(A, B, C, D, sync=False)

# Do other work...

# Wait for completion
args.sync()

Error Handling

The API performs validation and raises exceptions for:

Incompatible tensor shapes
Mismatched data types
Invalid layouts
Unsupported compute capabilities

try:
    plan = Gemm(A=A, B=B, C=C, D=D)
    plan.run()
except Exception as e:
    print(f"CUTLASS error: {e}")

Performance Considerations

The high-level Python API prioritizes ease of use over optimal performance. For production workloads requiring maximum performance, consider:

Explicitly specifying tile descriptions and kernel schedules
Using the lower-level cutlass.backend API
Tuning kernel parameters for your specific workload

Memory Management

The API integrates with existing tensor libraries:

PyTorch: Uses torch CUDA tensors directly
NumPy: Automatically transfers to/from GPU
CuPy: Uses cupy arrays directly
RMM: Optional support for RAPIDS Memory Manager

# Using PyTorch tensors (recommended for GPU)
A_torch = torch.randn((M, K), device='cuda')
plan = Gemm(A=A_torch, ...)

# Using NumPy arrays (involves copy to GPU)
A_numpy = np.random.randn(M, K).astype(np.float32)
plan = Gemm(A=A_numpy, ...)

Logging

Enable detailed logging for debugging:

import logging
import cutlass

cutlass.set_log_level(logging.DEBUG)

Next Steps

Learn about GEMM operations in detail
Explore layout types
Review utility functions

C++ Templates

CuTe Library

Python API

CuTe DSL

Python Operations Overview

Core Operations

Installation

Basic Usage Pattern

Simple Example

Decoupled Compilation

Key Concepts

Data Types

Layout Types

Compute Capability

Activation Functions

Asynchronous Execution

Error Handling

Performance Considerations

Memory Management

Logging

Next Steps

Build docs developers (and LLMs) love

C++ Templates

CuTe Library

Python API

CuTe DSL

​Core Operations

​Installation

​Basic Usage Pattern

​Simple Example

​Decoupled Compilation

​Key Concepts

​Data Types

​Layout Types

​Compute Capability

​Activation Functions

​Asynchronous Execution

​Error Handling

​Performance Considerations

​Memory Management

​Logging

​Next Steps

​Related Documentation

Build docs developers (and LLMs) love

Core Operations

Installation

Basic Usage Pattern

Simple Example

Decoupled Compilation

Key Concepts

Data Types

Layout Types

Compute Capability

Activation Functions

Asynchronous Execution

Error Handling

Performance Considerations

Memory Management

Logging

Next Steps

Related Documentation