Memory Layouts

Memory layout is critical for GPU performance. This page explains how CUTLASS represents and optimizes data layouts for different memory hierarchies and access patterns.

Why Layouts Matter

GPU performance depends heavily on memory access patterns:

Coalesced Access: Threads in a warp accessing consecutive memory locations
Bank Conflicts: Multiple threads accessing the same shared memory bank
Cache Utilization: Data reuse within L1/L2 caches
Vectorization: Loading/storing multiple elements per instruction

Performance Impact: Optimal layouts can provide 10-100× speedup compared to naive implementations!

Layout Fundamentals

A layout is a mathematical function mapping logical coordinates to linear memory offsets.

Definition

In CUTLASS/CuTe, a layout consists of:

Shape: Logical dimensions (e.g., (M, N) for a matrix)
Stride: Memory offset between elements in each dimension

namespace cutlass::layout {
  
class RowMajor {
  static int const kRank = 2;       // 2D tensor
  static int const kStrideRank = 1; // 1 stride parameter
  
  using Index = int32_t;
  using LongIndex = int64_t;
  using TensorCoord = MatrixCoord;  // (row, column)
  using Stride = Coord<kStrideRank, LongIndex>;
  
  Stride stride_;  // Leading dimension
  
  // Map coordinate to offset
  CUTLASS_HOST_DEVICE
  LongIndex operator()(MatrixCoord const &coord) const {
    return LongIndex(coord.row()) * stride_[0] + coord.column();
  }
};

} // namespace cutlass::layout

Reference: include/cutlass/layout/matrix.h:58

Common Matrix Layouts

Row-Major Layout

Elements in the same row are contiguous in memory:

class RowMajor {
  // Matrix: M rows × N columns
  // Stride: leading dimension (typically N)
  // offset(i, j) = i * stride + j
};

// Example: 4×3 matrix in row-major
// Logical:           Memory:
// [a00 a01 a02]      [a00 a01 a02 a10 a11 a12 a20 a21 a22 a30 a31 a32]
// [a10 a11 a12]       ^--------^  ^--------^  ^--------^  ^--------^
// [a20 a21 a22]       row 0       row 1       row 2       row 3
// [a30 a31 a32]

Use Case: Default for C/C++ multi-dimensional arrays

Column-Major Layout

Elements in the same column are contiguous in memory:

class ColumnMajor {
  // Matrix: M rows × N columns  
  // Stride: leading dimension (typically M)
  // offset(i, j) = i + j * stride
};

// Example: 4×3 matrix in column-major
// Logical:           Memory:
// [a00 a01 a02]      [a00 a10 a20 a30 a01 a11 a21 a31 a02 a12 a22 a32]
// [a10 a11 a12]       ^-----------^  ^-----------^  ^-----------^
// [a20 a21 a22]       column 0       column 1       column 2
// [a30 a31 a32]

Use Case: Fortran, cuBLAS, optimal for column-wise operations Reference: include/cutlass/layout/matrix.h:150

Row-Major vs Column-Major Performance

Neither layout is universally better - it depends on access patterns:Row-Major Wins:

Iterating over rows
Matrix-vector multiply: y = A * x (A row-major)

Column-Major Wins:

Iterating over columns
Matrix-vector multiply: y = A^T * x (A column-major)
BLAS library compatibility

CUTLASS: Supports both and converts between them efficiently

Tensor Layouts in CuTe

CuTe provides a more general layout representation:

template <class Shape, class Stride>
struct Layout {
  Shape shape_;    // Hierarchical shape
  Stride stride_;  // Hierarchical stride
  
  // Map coordinate to offset
  template <class Coord>
  CUTE_HOST_DEVICE
  auto operator()(Coord const& coord) const -> int {
    return crd2idx(coord, shape_, stride_);
  }
};

// Create layouts
auto layout = make_layout(
  make_shape(128, 128),   // Shape: 128×128
  make_stride(128, 1)     // Stride: row-major
);

Reference: include/cute/layout.hpp:98

Hierarchical Layouts

CuTe supports nested layouts for expressing complex patterns:

// Hierarchical shape: 2D grid of 2D tiles
auto layout = make_layout(
  make_shape(
    make_shape(8, 8),     // 8×8 tiles in threadblock
    make_shape(32, 32)    // Each tile is 32×32 elements  
  ),
  make_stride(
    make_stride(32, 256), // Outer stride: between tiles
    make_stride(1, 8)     // Inner stride: within tiles
  )
);

// Total shape: (8*32) × (8*32) = 256×256
// Tiled access pattern for cache locality

Swizzled Layouts

Swizzling reorders data to avoid shared memory bank conflicts:

Bank Conflict Problem

Shared memory is organized into banks (32 on modern GPUs):

// Without swizzling: threads access same bank
// Thread 0:  smem[0]   -> bank 0
// Thread 1:  smem[32]  -> bank 0  // CONFLICT!
// Thread 2:  smem[64]  -> bank 0  // CONFLICT!
// ...

// With swizzling: threads access different banks  
// Thread 0:  smem[0]   -> bank 0
// Thread 1:  smem[33]  -> bank 1
// Thread 2:  smem[66]  -> bank 2
// ...

Swizzle Functions

// XOR-based swizzle for shared memory layouts
template <int Bits>
struct Swizzle {
  // Swizzle the offset
  CUTE_HOST_DEVICE
  int operator()(int offset) const {
    int swizzle_bits = (offset >> Bits) ^ offset;
    return swizzle_bits;
  }
};

// Example: 128-bit swizzle for bank conflict avoidance
using SmemLayoutA = decltype(
  composition(
    Swizzle<4>{},  // XOR bits 4+ with bits 0-3
    make_layout(make_shape(128, 64), make_stride(64, 1))
  )
);

CUTLASS automatically applies appropriate swizzling for shared memory layouts. You rarely need to implement swizzling manually!

Tensor Core Layouts

Tensor Cores require specific layouts for input fragments:

Fragment Layouts

Data is distributed across threads in a warp:

// For SM80 16×8×8 MMA instruction:
// - 32 threads per warp
// - Each thread holds part of the matrix

// A Matrix (16×8): RowMajor
// FragmentA per thread: 4 elements
// Distribution: Thread i holds elements at positions:
//   [i, i+32, i+64, i+96] (modulo matrix size)

// B Matrix (8×8): ColumnMajor  
// FragmentB per thread: 2 elements
// Distribution: Thread i holds elements at positions:
//   [i, i+32] (modulo matrix size)

CUTLASS handles this automatically:

using MMA_Atom = MMA_Atom<SM80_16x8x8_F32F16F16F32_TN>;

// Partition automatically computes per-thread fragments
auto tCrA = mma.partition_fragment_A(gA);
auto tCrB = mma.partition_fragment_B(gB);

TiledMMA Layouts

Combine multiple MMA atoms into larger tile layouts:

using TiledMMA = TiledMMA<
  MMA_Atom<SM80_16x8x16_F32F16F16F32_TN>,
  Layout<Shape<_2, _4, _1>>,  // 2×4 arrangement of atoms
  Layout<Shape<_1, _2, _1>>   // Thread value layout  
>;

// Effective tile shape: (2*16) × (4*8) × 16 = 32×32×16

Layout Transformations

CuTe provides powerful layout manipulation operations:

Logical Divide

Split a layout into hierarchical tiles:

auto layout = make_layout(make_shape(128, 128));

// Divide into 32×32 tiles
auto tiled = logical_divide(layout, make_shape(32, 32));

// Result shape: ((4, 4), (32, 32))
// Outer mode: 4×4 grid of tiles
// Inner mode: 32×32 elements per tile

Composition

Compose two layouts:

auto layoutA = make_layout(make_shape(8, 8));
auto layoutB = make_layout(make_shape(64), make_stride(8));

// Apply layoutA, then layoutB  
auto composed = composition(layoutA, layoutB);

Complement

Find the “unused” dimensions:

auto layout = make_layout(
  make_shape(8, 16),
  make_stride(16, 1)  
);

// Find orthogonal layout (remaining coordinates)
auto comp = complement(layout, 128);

Coalesce

Simplify nested layouts:

// Flatten adjacent modes with compatible strides
auto layout = make_layout(
  make_shape(make_shape(8, 4), 16),
  make_stride(make_stride(1, 8), 32)
);

auto flat = coalesce(layout);
// Result: shape (32, 16), stride (1, 32)

Common Layout Transformations

Operation	Purpose	Example
`logical_divide`	Create tiled hierarchy	Split 128×128 → 4×4 tiles of 32×32
`composition`	Chain layouts	Apply permutation then scaling
`complement`	Find orthogonal space	Unused thread indices
`coalesce`	Simplify hierarchy	Flatten adjacent dimensions
`filter`	Select dimensions	Extract mode 0 and 2
`group`	Regroup modes	Change hierarchy

Pitched Linear Layout

For non-matrix data (e.g., images):

class PitchLinearShape {
  // (contiguous, strided) coordinates
  // Example: image with row padding
  //   contiguous = width
  //   strided = height
  //   stride = padded_width
};

class PitchLinear {
  Stride stride_;
  
  CUTLASS_HOST_DEVICE  
  LongIndex operator()(PitchLinearCoord const &coord) const {
    return coord.contiguous() + coord.strided() * stride_[0];
  }
};

Alignment Requirements

GPU memory operations have alignment requirements:

Vectorized Access

// 128-bit (16-byte) vectorized access
using AccessType = cutlass::Array<half_t, 8>;  // 8 × 2 bytes = 16 bytes

// Pointer must be 16-byte aligned
static_assert(alignof(AccessType) == 16);

// Stride must be a multiple of vector length
static_assert(kStrideA % 8 == 0);

Global Memory

// Coalesced access requires:
// 1. Adjacent threads access adjacent addresses
// 2. First address is aligned to access size  
// 3. Access size is 4, 8, 12, or 16 bytes

// Good: Coalesced 128-bit access
float4* ptr = aligned_allocate<float4>(n);
float4 val = ptr[thread_id];  // Each thread loads 16 bytes

// Bad: Uncoalesced access
float* ptr = allocate(n);
float val = ptr[thread_id * stride];  // Non-unit stride

Shared Memory

// Shared memory must be 128-byte aligned for TMA (SM90+)
__shared__ alignas(128) float smem[M * N];

// Bank conflict free access requires:
// - Stride not a multiple of 32 (for 32-bit elements)
// - Use swizzled layouts

CUTLASS templates automatically handle alignment. Use cutlass::AlignedBuffer for manual allocations:

cutlass::AlignedBuffer<float, 1024, 16> buffer;  // 16-byte aligned

Layout Examples

Example 1: Shared Memory Layout

// Shared memory for 128×64 tile with swizzling
using SmemLayoutA = decltype(
  composition(
    Swizzle<3, 3, 3>{},  // XOR swizzle pattern
    make_layout(
      make_shape(_128{}, _64{}),
      make_stride(_64{}, _1{})  // Column-major
    )
  )
);

__shared__ float smem_A[cosize(SmemLayoutA{})];
auto tensorA = make_tensor(make_smem_ptr(smem_A), SmemLayoutA{});

Example 2: Thread Partitioning

// Partition global tensor across threadblock
auto gA = make_tensor(
  make_gmem_ptr(A_ptr),
  make_layout(make_shape(M, K), make_stride(K, 1))
);

// Each thread gets a slice
auto thread_layout = make_layout(
  make_shape(32),         // 32 threads per warp  
  make_stride(4)          // Each thread stride 4
);

auto tAgA = local_partition(gA, thread_layout, thread_id);
// Each thread now accesses a strided slice of gA

Example 3: Tensor Core Layout

// MMA atom defines its own layout requirements
using MMA = MMA_Atom<SM80_16x8x16_F32F16F16F32_TN>;

// Partition shared memory according to MMA layout
auto sA = make_tensor(smem_ptr, smem_layout);
auto tCsA = mma.partition_A(sA);

// Each thread now has the correct fragment layout
auto tCrA = make_fragment_like(tCsA);
copy(tCsA, tCrA);  // Load from smem to registers

Performance Best Practices

Layout Optimization Checklist:

Global Memory:
- ✓ Use coalesced access patterns (adjacent threads → adjacent memory)
- ✓ Align pointers to 128 bits (16 bytes)
- ✓ Vectorize loads/stores (128-bit preferred)
Shared Memory:
- ✓ Apply swizzling to avoid bank conflicts
- ✓ Pad dimensions to avoid strided access
- ✓ Reuse data across thread iterations
Registers:
- ✓ Minimize register pressure (allow more occupancy)
- ✓ Use correct fragment layouts for Tensor Cores
- ✓ Prefer compile-time known indices
CuTe Layouts:
- ✓ Use static shapes when possible (better optimization)
- ✓ Let CuTe handle index computation (fewer bugs)
- ✓ Print layouts during development for debugging

Debugging Layouts

#include "cute/util/print.hpp"

if (thread0()) {
  // Print layout shape and stride
  print("Layout: ");
  print(layout);
  print("\n");
  
  // Print specific coordinate mapping
  auto coord = make_coord(10, 20);
  print("coord ");
  print(coord);  
  print(" -> offset ");
  print(layout(coord));
  print("\n");
  
  // Print human-readable layout visualization
  print_layout(layout);
}

Advanced: TMA Layouts (SM90+)

Tensor Memory Accelerator requires special descriptor layouts:

// Create TMA descriptor for bulk tensor copy
auto tma_load = make_tma_copy(
  SM90_TMA_LOAD{},
  make_tensor(
    make_gmem_ptr(global_ptr),
    make_layout(make_shape(128, 128), make_stride(128, 1))
  ),
  make_layout(make_shape(64, 64))  // Tile size
);

// Copy entire 64×64 tile in hardware
tma_load.copy(dst_tensor);

Next Steps

CuTe Library

Deep dive into CuTe layout algebra

Tensor Cores

Understand Tensor Core fragment layouts

GEMM Operations

Apply layouts to GEMM kernels

Examples

See layouts in real code

Get Started

Core Concepts

C++ API

Python DSL

Architecture Support

Performance

​Memory Layouts

​Why Layouts Matter

​Layout Fundamentals

​Definition

​Common Matrix Layouts

​Row-Major Layout

​Column-Major Layout

​Tensor Layouts in CuTe

​Hierarchical Layouts

​Swizzled Layouts

​Bank Conflict Problem

​Swizzle Functions

​Tensor Core Layouts

​Fragment Layouts

​TiledMMA Layouts

​Layout Transformations

​Logical Divide

​Composition

​Complement

​Coalesce

​Pitched Linear Layout

​Alignment Requirements

​Vectorized Access

​Global Memory

​Shared Memory

​Layout Examples

​Example 1: Shared Memory Layout

​Example 2: Thread Partitioning

​Example 3: Tensor Core Layout

​Performance Best Practices

​Debugging Layouts

​Advanced: TMA Layouts (SM90+)

​Next Steps

CuTe Library

Tensor Cores

GEMM Operations

Examples

Build docs developers (and LLMs) love

Memory Layouts

Why Layouts Matter

Layout Fundamentals

Definition

Common Matrix Layouts

Row-Major Layout

Column-Major Layout

Tensor Layouts in CuTe

Hierarchical Layouts

Swizzled Layouts

Bank Conflict Problem

Swizzle Functions

Tensor Core Layouts

Fragment Layouts

TiledMMA Layouts

Layout Transformations

Logical Divide

Composition

Complement

Coalesce

Pitched Linear Layout

Alignment Requirements

Vectorized Access

Global Memory

Shared Memory

Layout Examples

Example 1: Shared Memory Layout

Example 2: Thread Partitioning

Example 3: Tensor Core Layout

Performance Best Practices

Debugging Layouts

Advanced: TMA Layouts (SM90+)

Next Steps