Python DSL Overview

CUTLASS provides two powerful Python packages for GPU kernel development:

Python Packages

cutlass_cppgen

High-level Python interface for compiling and running CUTLASS kernels

CuTe DSL

Python DSL for writing custom CUDA kernels using CuTe abstractions

CUTLASS Python Interface (cutlass_cppgen)

The CUTLASS Python interface enables you to compile and run CUTLASS operations from Python with minimal configuration.

Key Features

High-level interfaces requiring only a few parameters
Automatic selection of sensible default configurations
Enumeration of known working configurations
Descriptive Python exceptions instead of C++ compile errors
Easy export to framework extensions (PyTorch CUDA extensions)

Quick Example

import cutlass
import numpy as np

plan = cutlass.op.Gemm(element=np.float16, layout=cutlass.LayoutType.RowMajor)
A, B, C, D = [np.ones((1024, 1024), dtype=np.float16) for i in range(4)]
plan.run(A, B, C, D)

Supported Operations

GEMMs - General matrix multiplication
Fused Epilogues - GEMMs with elementwise operations (e.g., ReLU) for pre-SM90
Stream K - Stream K swizzling for pre-SM90 kernels
Grouped GEMM - Multiple GEMMs in a single kernel for pre-SM90

Design Philosophy

Goals

Ease of Use

Present high-level interfaces that require minimal parameters and automatically select sensible defaults

Discoverability

Enumerate configurations known to work in a given setting

Better Error Messages

Emit descriptive Python runtime exceptions instead of C++ compile-time errors where possible

Framework Integration

Simplify exporting CUTLASS kernels to deep learning framework extensions

Non-Goals

The CUTLASS Python interface does not intend to:

Select optimal kernel configurations - Default selections may not achieve highest performance. Users should:
- Profile different parameter combinations, or
- Use optimized libraries like cuBLAS
Act as a fast container - Does not minimize Python overhead. For deployment:
- Use the emitted C++ code directly, or
- Use framework extension emitters
Be a JIT compilation engine - Enables CUTLASS in Python but doesn’t aim to be a Python-to-CUDA JIT framework

Comparison to PyCUTLASS

The CUTLASS Python interface builds on PyCUTLASS but provides a higher-level API:

Feature	PyCUTLASS	CUTLASS Python Interface
Configuration	Exhaustive template parameters	Minimal high-level parameters
Flexibility	Maximum (similar to C++ API)	Focused on common use cases
Ease of Use	Requires detailed knowledge	Simplified with smart defaults
Learning Curve	Steep	Gentle

CuTe DSL

The CuTe DSL (Domain-Specific Language) is a Python-based framework for writing high-performance CUDA kernels using CuTe’s layout algebra and tensor abstractions.

Learn More

Explore the CuTe DSL documentation for kernel development

Key Capabilities

Write CUDA kernels in Python using @cute.kernel decorator
Express complex tensor layouts with layout algebra
Utilize hardware features (Tensor Cores, TMA, async pipelines)
JIT compilation to optimized PTX/SASS
Integration with PyTorch, JAX, and NumPy

Quick Example

import cutlass
import cutlass.cute as cute

@cute.kernel
def elementwise_add(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor):
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    
    # Partition work across threads
    tiled_copy = cute.make_tiled_copy_tv(...)
    thr_copy = tiled_copy.get_slice(tidx)
    
    # Load, compute, store
    frgA = cute.make_fragment_like(thr_copy.partition_S(gA[bidx]))
    frgB = cute.make_fragment_like(thr_copy.partition_S(gB[bidx]))
    cute.copy(thr_copy, gA[bidx], frgA)
    cute.copy(thr_copy, gB[bidx], frgB)
    
    result = frgA.load() + frgB.load()
    frgC.store(result)
    cute.copy(thr_copy, frgC, gC[bidx])

Installation

cutlass_cppgen (PyPI)
CuTe DSL
From Source
Docker

pip install nvidia-cutlass

Any packages named cutlass (without nvidia- prefix) are not affiliated with NVIDIA CUTLASS.

pip install nvidia-cutlass-dsl

Or install from source:

cd /path/to/cutlass/python/CuTeDSL
pip install .

# Clone CUTLASS repository
git clone https://github.com/NVIDIA/cutlass.git
cd cutlass

# Install cutlass_cppgen
pip install .

# Or install in editable mode for development
pip install -e .

# Use NGC PyTorch container (recommended)
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.08-py3

# Install CUTLASS packages inside container
pip install nvidia-cutlass nvidia-cutlass-dsl

Requirements

CUDA

CUDA 11.8, 12.0, 12.1+
Matching cuda-python version

Python

Python 3.8, 3.9, 3.10+
PyTorch (optional, for integration)

GPU

Ampere (SM80+) for basic features
Hopper (SM90) for advanced features
Blackwell for latest features

Environment

CUTLASS_PATH (optional)
CUDA_INSTALL_PATH (optional)

Next Steps

Quickstart

Get started with CUTLASS Python in minutes

CuTe DSL

Learn the CuTe DSL for custom kernels

Examples

Explore example kernels and notebooks

PyTorch Integration

Integrate with PyTorch workflows

Get Started

Core Concepts

C++ API

Python DSL

Architecture Support

Performance

Python DSL Overview

Python Packages

cutlass_cppgen

CuTe DSL

CUTLASS Python Interface (cutlass_cppgen)

Key Features

Quick Example

Supported Operations

Design Philosophy

Goals

Non-Goals

Comparison to PyCUTLASS

CuTe DSL

Learn More

Key Capabilities

Quick Example

Installation

Requirements

CUDA

Python

GPU

Environment

Next Steps

Quickstart

CuTe DSL

Examples

PyTorch Integration

Resources

Build docs developers (and LLMs) love

Get Started

Core Concepts

C++ API

Python DSL

Architecture Support

Performance

​Python Packages

cutlass_cppgen

CuTe DSL

​CUTLASS Python Interface (cutlass_cppgen)

​Key Features

​Quick Example

​Supported Operations

​Design Philosophy

​Goals

​Non-Goals

​Comparison to PyCUTLASS

​CuTe DSL

Learn More

​Key Capabilities

​Quick Example

​Installation

​Requirements

CUDA

Python

GPU

Environment

​Next Steps

Quickstart

CuTe DSL

Examples

PyTorch Integration

​Resources

Build docs developers (and LLMs) love

Python Packages

CUTLASS Python Interface (cutlass_cppgen)

Key Features

Quick Example

Supported Operations

Design Philosophy

Goals

Non-Goals

Comparison to PyCUTLASS

CuTe DSL

Key Capabilities

Quick Example

Installation

Requirements

Next Steps

Resources