Skip to main content
CUTLASS provides two powerful Python packages for GPU kernel development:

Python Packages

cutlass_cppgen

High-level Python interface for compiling and running CUTLASS kernels

CuTe DSL

Python DSL for writing custom CUDA kernels using CuTe abstractions

CUTLASS Python Interface (cutlass_cppgen)

The CUTLASS Python interface enables you to compile and run CUTLASS operations from Python with minimal configuration.

Key Features

  • High-level interfaces requiring only a few parameters
  • Automatic selection of sensible default configurations
  • Enumeration of known working configurations
  • Descriptive Python exceptions instead of C++ compile errors
  • Easy export to framework extensions (PyTorch CUDA extensions)

Quick Example

import cutlass
import numpy as np

plan = cutlass.op.Gemm(element=np.float16, layout=cutlass.LayoutType.RowMajor)
A, B, C, D = [np.ones((1024, 1024), dtype=np.float16) for i in range(4)]
plan.run(A, B, C, D)

Supported Operations

  • GEMMs - General matrix multiplication
  • Fused Epilogues - GEMMs with elementwise operations (e.g., ReLU) for pre-SM90
  • Stream K - Stream K swizzling for pre-SM90 kernels
  • Grouped GEMM - Multiple GEMMs in a single kernel for pre-SM90

Design Philosophy

Goals

Present high-level interfaces that require minimal parameters and automatically select sensible defaults
Enumerate configurations known to work in a given setting
Emit descriptive Python runtime exceptions instead of C++ compile-time errors where possible
Simplify exporting CUTLASS kernels to deep learning framework extensions

Non-Goals

The CUTLASS Python interface does not intend to:
  1. Select optimal kernel configurations - Default selections may not achieve highest performance. Users should:
    • Profile different parameter combinations, or
    • Use optimized libraries like cuBLAS
  2. Act as a fast container - Does not minimize Python overhead. For deployment:
    • Use the emitted C++ code directly, or
    • Use framework extension emitters
  3. Be a JIT compilation engine - Enables CUTLASS in Python but doesn’t aim to be a Python-to-CUDA JIT framework

Comparison to PyCUTLASS

The CUTLASS Python interface builds on PyCUTLASS but provides a higher-level API:
FeaturePyCUTLASSCUTLASS Python Interface
ConfigurationExhaustive template parametersMinimal high-level parameters
FlexibilityMaximum (similar to C++ API)Focused on common use cases
Ease of UseRequires detailed knowledgeSimplified with smart defaults
Learning CurveSteepGentle

CuTe DSL

The CuTe DSL (Domain-Specific Language) is a Python-based framework for writing high-performance CUDA kernels using CuTe’s layout algebra and tensor abstractions.

Learn More

Explore the CuTe DSL documentation for kernel development

Key Capabilities

  • Write CUDA kernels in Python using @cute.kernel decorator
  • Express complex tensor layouts with layout algebra
  • Utilize hardware features (Tensor Cores, TMA, async pipelines)
  • JIT compilation to optimized PTX/SASS
  • Integration with PyTorch, JAX, and NumPy

Quick Example

import cutlass
import cutlass.cute as cute

@cute.kernel
def elementwise_add(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor):
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    
    # Partition work across threads
    tiled_copy = cute.make_tiled_copy_tv(...)
    thr_copy = tiled_copy.get_slice(tidx)
    
    # Load, compute, store
    frgA = cute.make_fragment_like(thr_copy.partition_S(gA[bidx]))
    frgB = cute.make_fragment_like(thr_copy.partition_S(gB[bidx]))
    cute.copy(thr_copy, gA[bidx], frgA)
    cute.copy(thr_copy, gB[bidx], frgB)
    
    result = frgA.load() + frgB.load()
    frgC.store(result)
    cute.copy(thr_copy, frgC, gC[bidx])

Installation

pip install nvidia-cutlass
Any packages named cutlass (without nvidia- prefix) are not affiliated with NVIDIA CUTLASS.

Requirements

CUDA

  • CUDA 11.8, 12.0, 12.1+
  • Matching cuda-python version

Python

  • Python 3.8, 3.9, 3.10+
  • PyTorch (optional, for integration)

GPU

  • Ampere (SM80+) for basic features
  • Hopper (SM90) for advanced features
  • Blackwell for latest features

Environment

  • CUTLASS_PATH (optional)
  • CUDA_INSTALL_PATH (optional)

Next Steps

Quickstart

Get started with CUTLASS Python in minutes

CuTe DSL

Learn the CuTe DSL for custom kernels

Examples

Explore example kernels and notebooks

PyTorch Integration

Integrate with PyTorch workflows

Resources

Build docs developers (and LLMs) love