CuTe Library - CUTLASS

Introduction

CuTe (CUDA Templates) is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. It provides a composable, structured approach to writing high-performance CUDA kernels.

CuTe is the foundation for CUTLASS 3.x kernels and provides a more modern, composable alternative to the CUTLASS 2.x hierarchy.

Core Concepts

CuTe is built around three fundamental abstractions:

Layout

Describes the mapping from logical coordinates to linear indices

Tensor

Combines data (pointer) with a Layout to provide structured access

Algorithms

Operations on Tensors (copy, fill, GEMM, etc.)

Layouts

Layouts define the relationship between multi-dimensional coordinates and linear memory addresses.

Layout Definition

include/cute/layout.hpp:98

template <class Shape, class Stride = LayoutLeft::Apply<Shape>>
struct Layout {
  CUTE_HOST_DEVICE constexpr
  Layout(Shape const& shape = {}, Stride const& stride = {});
  
  template <int... I>
  CUTE_HOST_DEVICE constexpr
  decltype(auto) shape() const;
  
  template <int... I>
  CUTE_HOST_DEVICE constexpr
  decltype(auto) stride() const;
};

Creating Layouts

using namespace cute;

// 1D layout: 8 elements
auto layout_1d = make_layout(8);

// 2D layout: 4×8 column-major
auto layout_2d = make_layout(
  make_shape(4, 8),           // Shape: (4, 8)
  make_stride(1, 4)           // Stride: (1, 4) - column-major
);

// 2D layout: 4×8 row-major
auto layout_2d_row = make_layout(
  make_shape(4, 8),
  make_stride(8, 1)           // Stride: (8, 1) - row-major
);

// Hierarchical layout: (4,8):(1,4) partitioned as ((2,2),(4,2))
auto layout_hier = make_layout(
  make_shape(make_shape(2, 2), make_shape(4, 2)),
  make_stride(make_stride(1, 2), make_stride(4, 16))
);

Layout Operations

make_layout(shape, stride)

function

Creates a Layout from a shape and stride

size(layout)

function

Returns the total number of elements in the layout

rank(layout)

function

Returns the number of dimensions (modes) in the layout

depth(layout)

function

Returns the hierarchical depth of the layout

coalesce(layout)

function

Simplifies the layout by merging adjacent modes with compatible strides

Tensors

Tensors combine a pointer with a Layout to provide structured multi-dimensional views of data.

Tensor Definition

include/cute/tensor_impl.hpp

template <class Engine, class Layout>
struct Tensor {
  using iterator     = typename Engine::iterator;
  using value_type   = typename Engine::value_type;
  using element_type = typename Engine::element_type;
  using reference    = typename Engine::reference;
};

Creating Tensors

using namespace cute;

// Create tensor from global memory pointer
float* gmem_ptr = /* ... */;
auto gmem_tensor = make_tensor(
  make_gmem_ptr(gmem_ptr),
  make_layout(make_shape(M, N), make_stride(1, M))  // Column-major
);

// Indexing: gmem_tensor(i, j) accesses element at row i, column j
float value = gmem_tensor(2, 3);

Tensor Operations

make_tensor(pointer, layout)

function

Creates a Tensor from a pointer and layout

local_tile(tensor, tile_shape, coord)

function

Extracts a tile from a tensor at the given coordinate

local_partition(tensor, thread_layout, thread_id)

function

Partitions a tensor across threads according to a thread layout

make_tensor_like(tensor)

function

Creates a new tensor with the same shape and layout (useful for accumulators)

Practical Example: GEMM Kernel

From examples/cute/tutorial/sgemm_1.cu:

template <class ProblemShape, class CtaTiler,
          class TA, class AStride,
          class TB, class BStride,
          class TC, class CStride,
          class Alpha, class Beta>
__global__ void gemm_device(
    ProblemShape shape_MNK, CtaTiler cta_tiler,
    TA const* A, AStride dA,
    TB const* B, BStride dB,
    TC* C, CStride dC,
    Alpha alpha, Beta beta) {
  
  using namespace cute;
  
  // Create full tensor views
  Tensor mA = make_tensor(make_gmem_ptr(A), 
                          select<0,2>(shape_MNK), dA);  // (M,K)
  Tensor mB = make_tensor(make_gmem_ptr(B), 
                          select<1,2>(shape_MNK), dB);  // (N,K)
  Tensor mC = make_tensor(make_gmem_ptr(C), 
                          select<0,1>(shape_MNK), dC);  // (M,N)
  
  // Get this CTA's tile
  auto cta_coord = make_coord(blockIdx.x, blockIdx.y, _);
  Tensor gA = local_tile(mA, cta_tiler, cta_coord, Step<_1, X,_1>{});
  Tensor gB = local_tile(mB, cta_tiler, cta_coord, Step< X,_1,_1>{});
  Tensor gC = local_tile(mC, cta_tiler, cta_coord, Step<_1,_1, X>{});
}

Algorithms

CuTe provides high-level algorithms for common operations.

Copy Operations

include/cute/algorithm/copy.hpp

// Simple copy
copy(src_tensor, dst_tensor);

// Async copy (Ampere+)
copy_async(src_tensor, dst_tensor);

// Copy with predication
copy_if(predicate, src_tensor, dst_tensor);

// Cooperative copy across thread group
cooperative_copy<NumThreads>(tid, src_tensor, dst_tensor);

Fill and Clear

Fill Operations

// Fill with value
fill(tensor, 3.14f);

// Clear to zero
clear(tensor);

GEMM Algorithm

include/cute/algorithm/gemm.hpp

// Register-level GEMM: D = A * B + C
gemm(A_tensor, B_tensor, C_tensor);

// Cooperative GEMM across thread group
cooperative_gemm<NumThreads>(tid, A_tensor, B_tensor, C_tensor);

AXPBY (Linear Combination)

AXPBY

// Y = alpha * X + beta * Y
axpby(alpha, X_tensor, beta, Y_tensor);

Special Layouts and Patterns

Swizzling

Swizzled layouts reduce shared memory bank conflicts:

Swizzled Layout

using namespace cute;

// Create swizzled layout for shared memory
auto smem_layout = composition(
  Swizzle<3,0,3>{},              // XOR swizzle pattern
  make_layout(make_shape(128, 32), make_stride(1, 128))
);

Blocked Layouts

// Create blocked/tiled layout
auto blocked = blocked_product(
  make_layout(make_shape(4, 8)),    // Outer block shape
  make_layout(make_shape(2, 2))     // Inner tile shape
);
// Results in ((4,2),(8,2)) layout

Type System

Compile-Time Integers

Integral Constants

using namespace cute;

// Compile-time integer
Int<4> static_four{};

// Arithmetic at compile time
auto result = Int<4>{} * Int<8>{};  // Int<32>

// Underscore for dynamic dimensions
auto shape = make_shape(Int<4>{}, _, Int<8>{});

Tuples

Tuple Types

using namespace cute;

// Create tuple
auto t = make_tuple(1, 2, 3);

// Access elements
auto first = get<0>(t);   // 1
auto second = get<1>(t);  // 2

// Hierarchical tuples
auto nested = make_tuple(
  make_tuple(1, 2),
  make_tuple(3, 4)
);

Atom Types

Atoms describe hardware-specific instruction patterns.

Copy Atoms

#include <cute/atom/copy_atom.hpp>

// LDG.128 - 128-bit global memory load
using GmemLoadAtom = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<uint128_t>, float>;

// LDSM - Shared memory load for matrix
using SmemLoadAtom = Copy_Atom<SM75_U32x4_LDSM_N, half_t>;

MMA Atoms

#include <cute/atom/mma_atom.hpp>

// Tensor Core MMA atom (Ampere)
using MmaAtom = MMA_Atom<SM80_16x8x16_F16F16F16F16_TN>;

// SIMT FMA atom
using SimtMma = MMA_Atom<SM70_8x8x4_F32F16F16F32_TN>;

Debugging and Visualization

Print Utilities

Printing

using namespace cute;

// Print layout
auto layout = make_layout(make_shape(4, 8), make_stride(1, 4));
print(layout);
// Output: (4,8):(1,4)

// Print tensor
Tensor tensor = make_tensor(ptr, layout);
print_tensor(tensor);

// Print in LaTeX format for documentation
print_latex(layout);

Compile-Time Assertions

Static Assertions

CUTE_STATIC_ASSERT_V(size(layout) == Int<32>{});
CUTE_STATIC_ASSERT_V(rank(layout) == Int<2>{});
CUTE_STATIC_ASSERT_V(is_static<decltype(layout)>::value);

Key Functions Reference

Shape and Stride Utilities

make_shape(...)

function

Creates a Shape tuple from arguments

make_stride(...)

function

Creates a Stride tuple from arguments

make_coord(...)

function

Creates a coordinate tuple for indexing

size<I>(layout)

function

Returns the size of mode I (or total size if I omitted)

shape<I>(layout)

function

Returns the shape of mode I (or full shape if I omitted)

stride<I>(layout)

function

Returns the stride of mode I (or full stride if I omitted)

Composition and Manipulation

composition(layout_a, layout_b)

function

Composes two layouts: layout_b ∘ layout_a

complement(layout, shape)

function

Returns the complementary layout within a given shape

logical_divide(layout, tile)

function

Divides a layout into tiles

zipped_divide(layout, tile)

function

Divides and interleaves (for thread partitioning)

Best Practices

Use Static Shapes When Possible

Static shapes (Int<N>) enable compile-time optimizations and better code generation.

Leverage Layout Composition

Build complex layouts by composing simpler ones - this is more maintainable and often more efficient.

Partition Before Loop

Partition tensors across threads outside loops to avoid recomputation.

Use Typed Tensors

Let CuTe’s type system catch shape mismatches at compile time.

Advanced Topics

TMA (Tensor Memory Accelerator)

Hopper architecture’s hardware-accelerated tensor loads:

TMA Copy

#include <cute/atom/copy_atom.hpp>

// TMA descriptor-based copy
using TmaLoadAtom = Copy_Atom<SM90_TMA_LOAD, float>;

// Use in kernel with TMA descriptor

Warp-Specialized Kernels

Different warps perform different roles:

Warp Specialization

int warp_id = threadIdx.x / 32;
int lane_id = threadIdx.x % 32;

if (warp_id == 0) {
  // Producer warp: load data
} else {
  // Consumer warp: compute
}

CuTe Tutorials

Examples in examples/cute/tutorial/ demonstrate progressive complexity

GEMM API

High-level GEMM API built on CuTe (CUTLASS 3.x)

Architecture Guide

Learn about architecture-specific features

Performance Guide

Optimize CuTe-based kernels

Get Started

Core Concepts

C++ API

Python DSL

Architecture Support

Performance

​Introduction

​Core Concepts

Layout

Tensor

Algorithms

​Layouts

​Layout Definition

​Creating Layouts

​Layout Operations

​Tensors

​Tensor Definition

​Creating Tensors

​Tensor Operations

​Practical Example: GEMM Kernel

​Algorithms

​Copy Operations

​Fill and Clear

​GEMM Algorithm

​AXPBY (Linear Combination)

​Special Layouts and Patterns

​Swizzling

​Blocked Layouts

​Type System

​Compile-Time Integers

​Tuples

​Atom Types

​Copy Atoms

​MMA Atoms

​Debugging and Visualization

​Print Utilities

​Compile-Time Assertions

​Key Functions Reference

​Shape and Stride Utilities

​Composition and Manipulation

​Best Practices

Use Static Shapes When Possible

Leverage Layout Composition

Partition Before Loop

Use Typed Tensors

​Advanced Topics

​TMA (Tensor Memory Accelerator)

​Warp-Specialized Kernels

​See Also

CuTe Tutorials

GEMM API

Architecture Guide

Performance Guide

Build docs developers (and LLMs) love

Introduction

Core Concepts

Layouts

Layout Definition

Creating Layouts

Layout Operations

Tensors

Tensor Definition

Creating Tensors

Tensor Operations

Practical Example: GEMM Kernel

Algorithms

Copy Operations

Fill and Clear

GEMM Algorithm

AXPBY (Linear Combination)

Special Layouts and Patterns

Swizzling

Blocked Layouts

Type System

Compile-Time Integers

Tuples

Atom Types

Copy Atoms

MMA Atoms

Debugging and Visualization

Print Utilities

Compile-Time Assertions

Key Functions Reference

Shape and Stride Utilities

Composition and Manipulation

Best Practices

Advanced Topics

TMA (Tensor Memory Accelerator)

Warp-Specialized Kernels

See Also