Skip to main content

Introduction

CuTe (CUDA Templates) is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. It provides a composable, structured approach to writing high-performance CUDA kernels.
CuTe is the foundation for CUTLASS 3.x kernels and provides a more modern, composable alternative to the CUTLASS 2.x hierarchy.

Core Concepts

CuTe is built around three fundamental abstractions:

Layout

Describes the mapping from logical coordinates to linear indices

Tensor

Combines data (pointer) with a Layout to provide structured access

Algorithms

Operations on Tensors (copy, fill, GEMM, etc.)

Layouts

Layouts define the relationship between multi-dimensional coordinates and linear memory addresses.

Layout Definition

include/cute/layout.hpp:98
template <class Shape, class Stride = LayoutLeft::Apply<Shape>>
struct Layout {
  CUTE_HOST_DEVICE constexpr
  Layout(Shape const& shape = {}, Stride const& stride = {});
  
  template <int... I>
  CUTE_HOST_DEVICE constexpr
  decltype(auto) shape() const;
  
  template <int... I>
  CUTE_HOST_DEVICE constexpr
  decltype(auto) stride() const;
};

Creating Layouts

using namespace cute;

// 1D layout: 8 elements
auto layout_1d = make_layout(8);

// 2D layout: 4×8 column-major
auto layout_2d = make_layout(
  make_shape(4, 8),           // Shape: (4, 8)
  make_stride(1, 4)           // Stride: (1, 4) - column-major
);

// 2D layout: 4×8 row-major
auto layout_2d_row = make_layout(
  make_shape(4, 8),
  make_stride(8, 1)           // Stride: (8, 1) - row-major
);

// Hierarchical layout: (4,8):(1,4) partitioned as ((2,2),(4,2))
auto layout_hier = make_layout(
  make_shape(make_shape(2, 2), make_shape(4, 2)),
  make_stride(make_stride(1, 2), make_stride(4, 16))
);

Layout Operations

make_layout(shape, stride)
function
Creates a Layout from a shape and stride
size(layout)
function
Returns the total number of elements in the layout
rank(layout)
function
Returns the number of dimensions (modes) in the layout
depth(layout)
function
Returns the hierarchical depth of the layout
coalesce(layout)
function
Simplifies the layout by merging adjacent modes with compatible strides

Tensors

Tensors combine a pointer with a Layout to provide structured multi-dimensional views of data.

Tensor Definition

include/cute/tensor_impl.hpp
template <class Engine, class Layout>
struct Tensor {
  using iterator     = typename Engine::iterator;
  using value_type   = typename Engine::value_type;
  using element_type = typename Engine::element_type;
  using reference    = typename Engine::reference;
};

Creating Tensors

using namespace cute;

// Create tensor from global memory pointer
float* gmem_ptr = /* ... */;
auto gmem_tensor = make_tensor(
  make_gmem_ptr(gmem_ptr),
  make_layout(make_shape(M, N), make_stride(1, M))  // Column-major
);

// Indexing: gmem_tensor(i, j) accesses element at row i, column j
float value = gmem_tensor(2, 3);

Tensor Operations

make_tensor(pointer, layout)
function
Creates a Tensor from a pointer and layout
local_tile(tensor, tile_shape, coord)
function
Extracts a tile from a tensor at the given coordinate
local_partition(tensor, thread_layout, thread_id)
function
Partitions a tensor across threads according to a thread layout
make_tensor_like(tensor)
function
Creates a new tensor with the same shape and layout (useful for accumulators)

Practical Example: GEMM Kernel

From examples/cute/tutorial/sgemm_1.cu:
template <class ProblemShape, class CtaTiler,
          class TA, class AStride,
          class TB, class BStride,
          class TC, class CStride,
          class Alpha, class Beta>
__global__ void gemm_device(
    ProblemShape shape_MNK, CtaTiler cta_tiler,
    TA const* A, AStride dA,
    TB const* B, BStride dB,
    TC* C, CStride dC,
    Alpha alpha, Beta beta) {
  
  using namespace cute;
  
  // Create full tensor views
  Tensor mA = make_tensor(make_gmem_ptr(A), 
                          select<0,2>(shape_MNK), dA);  // (M,K)
  Tensor mB = make_tensor(make_gmem_ptr(B), 
                          select<1,2>(shape_MNK), dB);  // (N,K)
  Tensor mC = make_tensor(make_gmem_ptr(C), 
                          select<0,1>(shape_MNK), dC);  // (M,N)
  
  // Get this CTA's tile
  auto cta_coord = make_coord(blockIdx.x, blockIdx.y, _);
  Tensor gA = local_tile(mA, cta_tiler, cta_coord, Step<_1, X,_1>{});
  Tensor gB = local_tile(mB, cta_tiler, cta_coord, Step< X,_1,_1>{});
  Tensor gC = local_tile(mC, cta_tiler, cta_coord, Step<_1,_1, X>{});
}

Algorithms

CuTe provides high-level algorithms for common operations.

Copy Operations

include/cute/algorithm/copy.hpp
// Simple copy
copy(src_tensor, dst_tensor);

// Async copy (Ampere+)
copy_async(src_tensor, dst_tensor);

// Copy with predication
copy_if(predicate, src_tensor, dst_tensor);

// Cooperative copy across thread group
cooperative_copy<NumThreads>(tid, src_tensor, dst_tensor);

Fill and Clear

Fill Operations
// Fill with value
fill(tensor, 3.14f);

// Clear to zero
clear(tensor);

GEMM Algorithm

include/cute/algorithm/gemm.hpp
// Register-level GEMM: D = A * B + C
gemm(A_tensor, B_tensor, C_tensor);

// Cooperative GEMM across thread group
cooperative_gemm<NumThreads>(tid, A_tensor, B_tensor, C_tensor);

AXPBY (Linear Combination)

AXPBY
// Y = alpha * X + beta * Y
axpby(alpha, X_tensor, beta, Y_tensor);

Special Layouts and Patterns

Swizzling

Swizzled layouts reduce shared memory bank conflicts:
Swizzled Layout
using namespace cute;

// Create swizzled layout for shared memory
auto smem_layout = composition(
  Swizzle<3,0,3>{},              // XOR swizzle pattern
  make_layout(make_shape(128, 32), make_stride(1, 128))
);

Blocked Layouts

Blocked Layouts
// Create blocked/tiled layout
auto blocked = blocked_product(
  make_layout(make_shape(4, 8)),    // Outer block shape
  make_layout(make_shape(2, 2))     // Inner tile shape
);
// Results in ((4,2),(8,2)) layout

Type System

Compile-Time Integers

Integral Constants
using namespace cute;

// Compile-time integer
Int<4> static_four{};

// Arithmetic at compile time
auto result = Int<4>{} * Int<8>{};  // Int<32>

// Underscore for dynamic dimensions
auto shape = make_shape(Int<4>{}, _, Int<8>{});

Tuples

Tuple Types
using namespace cute;

// Create tuple
auto t = make_tuple(1, 2, 3);

// Access elements
auto first = get<0>(t);   // 1
auto second = get<1>(t);  // 2

// Hierarchical tuples
auto nested = make_tuple(
  make_tuple(1, 2),
  make_tuple(3, 4)
);

Atom Types

Atoms describe hardware-specific instruction patterns.

Copy Atoms

Copy Atoms
#include <cute/atom/copy_atom.hpp>

// LDG.128 - 128-bit global memory load
using GmemLoadAtom = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<uint128_t>, float>;

// LDSM - Shared memory load for matrix
using SmemLoadAtom = Copy_Atom<SM75_U32x4_LDSM_N, half_t>;

MMA Atoms

MMA Atoms
#include <cute/atom/mma_atom.hpp>

// Tensor Core MMA atom (Ampere)
using MmaAtom = MMA_Atom<SM80_16x8x16_F16F16F16F16_TN>;

// SIMT FMA atom
using SimtMma = MMA_Atom<SM70_8x8x4_F32F16F16F32_TN>;

Debugging and Visualization

Printing
using namespace cute;

// Print layout
auto layout = make_layout(make_shape(4, 8), make_stride(1, 4));
print(layout);
// Output: (4,8):(1,4)

// Print tensor
Tensor tensor = make_tensor(ptr, layout);
print_tensor(tensor);

// Print in LaTeX format for documentation
print_latex(layout);

Compile-Time Assertions

Static Assertions
CUTE_STATIC_ASSERT_V(size(layout) == Int<32>{});
CUTE_STATIC_ASSERT_V(rank(layout) == Int<2>{});
CUTE_STATIC_ASSERT_V(is_static<decltype(layout)>::value);

Key Functions Reference

Shape and Stride Utilities

make_shape(...)
function
Creates a Shape tuple from arguments
make_stride(...)
function
Creates a Stride tuple from arguments
make_coord(...)
function
Creates a coordinate tuple for indexing
size<I>(layout)
function
Returns the size of mode I (or total size if I omitted)
shape<I>(layout)
function
Returns the shape of mode I (or full shape if I omitted)
stride<I>(layout)
function
Returns the stride of mode I (or full stride if I omitted)

Composition and Manipulation

composition(layout_a, layout_b)
function
Composes two layouts: layout_b ∘ layout_a
complement(layout, shape)
function
Returns the complementary layout within a given shape
logical_divide(layout, tile)
function
Divides a layout into tiles
zipped_divide(layout, tile)
function
Divides and interleaves (for thread partitioning)

Best Practices

Use Static Shapes When Possible

Static shapes (Int<N>) enable compile-time optimizations and better code generation.

Leverage Layout Composition

Build complex layouts by composing simpler ones - this is more maintainable and often more efficient.

Partition Before Loop

Partition tensors across threads outside loops to avoid recomputation.

Use Typed Tensors

Let CuTe’s type system catch shape mismatches at compile time.

Advanced Topics

TMA (Tensor Memory Accelerator)

Hopper architecture’s hardware-accelerated tensor loads:
TMA Copy
#include <cute/atom/copy_atom.hpp>

// TMA descriptor-based copy
using TmaLoadAtom = Copy_Atom<SM90_TMA_LOAD, float>;

// Use in kernel with TMA descriptor

Warp-Specialized Kernels

Different warps perform different roles:
Warp Specialization
int warp_id = threadIdx.x / 32;
int lane_id = threadIdx.x % 32;

if (warp_id == 0) {
  // Producer warp: load data
} else {
  // Consumer warp: compute
}

See Also

CuTe Tutorials

Examples in examples/cute/tutorial/ demonstrate progressive complexity

GEMM API

High-level GEMM API built on CuTe (CUTLASS 3.x)

Architecture Guide

Learn about architecture-specific features

Performance Guide

Optimize CuTe-based kernels

Build docs developers (and LLMs) love