Memory Layouts
Memory layout is critical for GPU performance. This page explains how CUTLASS represents and optimizes data layouts for different memory hierarchies and access patterns.
Why Layouts Matter
GPU performance depends heavily on memory access patterns:
Coalesced Access : Threads in a warp accessing consecutive memory locations
Bank Conflicts : Multiple threads accessing the same shared memory bank
Cache Utilization : Data reuse within L1/L2 caches
Vectorization : Loading/storing multiple elements per instruction
Performance Impact:
Optimal layouts can provide 10-100× speedup compared to naive implementations!
Layout Fundamentals
A layout is a mathematical function mapping logical coordinates to linear memory offsets.
Definition
In CUTLASS/CuTe, a layout consists of:
Shape : Logical dimensions (e.g., (M, N) for a matrix)
Stride : Memory offset between elements in each dimension
namespace cutlass :: layout {
class RowMajor {
static int const kRank = 2 ; // 2D tensor
static int const kStrideRank = 1 ; // 1 stride parameter
using Index = int32_t ;
using LongIndex = int64_t ;
using TensorCoord = MatrixCoord ; // (row, column)
using Stride = Coord < kStrideRank , LongIndex >;
Stride stride_; // Leading dimension
// Map coordinate to offset
CUTLASS_HOST_DEVICE
LongIndex operator () ( MatrixCoord const & coord ) const {
return LongIndex ( coord . row ()) * stride_ [ 0 ] + coord . column ();
}
};
} // namespace cutlass::layout
Reference: include/cutlass/layout/matrix.h:58
Common Matrix Layouts
Row-Major Layout
Elements in the same row are contiguous in memory:
class RowMajor {
// Matrix: M rows × N columns
// Stride: leading dimension (typically N)
// offset(i, j) = i * stride + j
};
// Example: 4×3 matrix in row-major
// Logical: Memory:
// [a00 a01 a02] [a00 a01 a02 a10 a11 a12 a20 a21 a22 a30 a31 a32]
// [a10 a11 a12] ^--------^ ^--------^ ^--------^ ^--------^
// [a20 a21 a22] row 0 row 1 row 2 row 3
// [a30 a31 a32]
Use Case: Default for C/C++ multi-dimensional arrays
Column-Major Layout
Elements in the same column are contiguous in memory:
class ColumnMajor {
// Matrix: M rows × N columns
// Stride: leading dimension (typically M)
// offset(i, j) = i + j * stride
};
// Example: 4×3 matrix in column-major
// Logical: Memory:
// [a00 a01 a02] [a00 a10 a20 a30 a01 a11 a21 a31 a02 a12 a22 a32]
// [a10 a11 a12] ^-----------^ ^-----------^ ^-----------^
// [a20 a21 a22] column 0 column 1 column 2
// [a30 a31 a32]
Use Case: Fortran, cuBLAS, optimal for column-wise operations
Reference: include/cutlass/layout/matrix.h:150
Row-Major vs Column-Major Performance
Tensor Layouts in CuTe
CuTe provides a more general layout representation:
template < class Shape , class Stride >
struct Layout {
Shape shape_; // Hierarchical shape
Stride stride_; // Hierarchical stride
// Map coordinate to offset
template < class Coord >
CUTE_HOST_DEVICE
auto operator () ( Coord const& coord ) const -> int {
return crd2idx (coord, shape_, stride_);
}
};
// Create layouts
auto layout = make_layout (
make_shape ( 128 , 128 ), // Shape: 128×128
make_stride ( 128 , 1 ) // Stride: row-major
);
Reference: include/cute/layout.hpp:98
Hierarchical Layouts
CuTe supports nested layouts for expressing complex patterns:
// Hierarchical shape: 2D grid of 2D tiles
auto layout = make_layout (
make_shape (
make_shape ( 8 , 8 ), // 8×8 tiles in threadblock
make_shape ( 32 , 32 ) // Each tile is 32×32 elements
),
make_stride (
make_stride ( 32 , 256 ), // Outer stride: between tiles
make_stride ( 1 , 8 ) // Inner stride: within tiles
)
);
// Total shape: (8*32) × (8*32) = 256×256
// Tiled access pattern for cache locality
Swizzled Layouts
Swizzling reorders data to avoid shared memory bank conflicts:
Bank Conflict Problem
Shared memory is organized into banks (32 on modern GPUs):
// Without swizzling: threads access same bank
// Thread 0: smem[0] -> bank 0
// Thread 1: smem[32] -> bank 0 // CONFLICT!
// Thread 2: smem[64] -> bank 0 // CONFLICT!
// ...
// With swizzling: threads access different banks
// Thread 0: smem[0] -> bank 0
// Thread 1: smem[33] -> bank 1
// Thread 2: smem[66] -> bank 2
// ...
Swizzle Functions
// XOR-based swizzle for shared memory layouts
template < int Bits >
struct Swizzle {
// Swizzle the offset
CUTE_HOST_DEVICE
int operator () ( int offset ) const {
int swizzle_bits = (offset >> Bits) ^ offset;
return swizzle_bits;
}
};
// Example: 128-bit swizzle for bank conflict avoidance
using SmemLayoutA = decltype (
composition (
Swizzle < 4 > {}, // XOR bits 4+ with bits 0-3
make_layout ( make_shape ( 128 , 64 ), make_stride ( 64 , 1 ))
)
);
CUTLASS automatically applies appropriate swizzling for shared memory layouts. You rarely need to implement swizzling manually!
Tensor Core Layouts
Tensor Cores require specific layouts for input fragments:
Fragment Layouts
Data is distributed across threads in a warp:
// For SM80 16×8×8 MMA instruction:
// - 32 threads per warp
// - Each thread holds part of the matrix
// A Matrix (16×8): RowMajor
// FragmentA per thread: 4 elements
// Distribution: Thread i holds elements at positions:
// [i, i+32, i+64, i+96] (modulo matrix size)
// B Matrix (8×8): ColumnMajor
// FragmentB per thread: 2 elements
// Distribution: Thread i holds elements at positions:
// [i, i+32] (modulo matrix size)
CUTLASS handles this automatically:
using MMA_Atom = MMA_Atom < SM80_16x8x8_F32F16F16F32_TN >;
// Partition automatically computes per-thread fragments
auto tCrA = mma . partition_fragment_A (gA);
auto tCrB = mma . partition_fragment_B (gB);
TiledMMA Layouts
Combine multiple MMA atoms into larger tile layouts:
using TiledMMA = TiledMMA <
MMA_Atom < SM80_16x8x16_F32F16F16F32_TN > ,
Layout < Shape < _2, _4, _1 >> , // 2×4 arrangement of atoms
Layout < Shape < _1, _2, _1 >> // Thread value layout
> ;
// Effective tile shape: (2*16) × (4*8) × 16 = 32×32×16
CuTe provides powerful layout manipulation operations:
Logical Divide
Split a layout into hierarchical tiles:
auto layout = make_layout ( make_shape ( 128 , 128 ));
// Divide into 32×32 tiles
auto tiled = logical_divide (layout, make_shape ( 32 , 32 ));
// Result shape: ((4, 4), (32, 32))
// Outer mode: 4×4 grid of tiles
// Inner mode: 32×32 elements per tile
Composition
Compose two layouts:
auto layoutA = make_layout ( make_shape ( 8 , 8 ));
auto layoutB = make_layout ( make_shape ( 64 ), make_stride ( 8 ));
// Apply layoutA, then layoutB
auto composed = composition (layoutA, layoutB);
Complement
Find the “unused” dimensions:
auto layout = make_layout (
make_shape ( 8 , 16 ),
make_stride ( 16 , 1 )
);
// Find orthogonal layout (remaining coordinates)
auto comp = complement (layout, 128 );
Coalesce
Simplify nested layouts:
// Flatten adjacent modes with compatible strides
auto layout = make_layout (
make_shape ( make_shape ( 8 , 4 ), 16 ),
make_stride ( make_stride ( 1 , 8 ), 32 )
);
auto flat = coalesce (layout);
// Result: shape (32, 16), stride (1, 32)
Common Layout Transformations
Pitched Linear Layout
For non-matrix data (e.g., images):
class PitchLinearShape {
// (contiguous, strided) coordinates
// Example: image with row padding
// contiguous = width
// strided = height
// stride = padded_width
};
class PitchLinear {
Stride stride_;
CUTLASS_HOST_DEVICE
LongIndex operator () ( PitchLinearCoord const & coord ) const {
return coord . contiguous () + coord . strided () * stride_ [ 0 ];
}
};
Alignment Requirements
GPU memory operations have alignment requirements:
Vectorized Access
// 128-bit (16-byte) vectorized access
using AccessType = cutlass :: Array < half_t , 8 >; // 8 × 2 bytes = 16 bytes
// Pointer must be 16-byte aligned
static_assert ( alignof (AccessType) == 16 );
// Stride must be a multiple of vector length
static_assert (kStrideA % 8 == 0 );
Global Memory
// Coalesced access requires:
// 1. Adjacent threads access adjacent addresses
// 2. First address is aligned to access size
// 3. Access size is 4, 8, 12, or 16 bytes
// Good: Coalesced 128-bit access
float4 * ptr = aligned_allocate < float4 >(n);
float4 val = ptr [thread_id]; // Each thread loads 16 bytes
// Bad: Uncoalesced access
float * ptr = allocate (n);
float val = ptr [thread_id * stride]; // Non-unit stride
Shared Memory
// Shared memory must be 128-byte aligned for TMA (SM90+)
__shared__ alignas ( 128 ) float smem [M * N];
// Bank conflict free access requires:
// - Stride not a multiple of 32 (for 32-bit elements)
// - Use swizzled layouts
CUTLASS templates automatically handle alignment. Use cutlass::AlignedBuffer for manual allocations: cutlass ::AlignedBuffer < float , 1024 , 16 > buffer; // 16-byte aligned
Layout Examples
Example 1: Shared Memory Layout
// Shared memory for 128×64 tile with swizzling
using SmemLayoutA = decltype (
composition (
Swizzle < 3 , 3 , 3 > {}, // XOR swizzle pattern
make_layout (
make_shape (_128{}, _64{}),
make_stride (_64{}, _1{}) // Column-major
)
)
);
__shared__ float smem_A [ cosize (SmemLayoutA{})];
auto tensorA = make_tensor ( make_smem_ptr (smem_A), SmemLayoutA{});
Example 2: Thread Partitioning
// Partition global tensor across threadblock
auto gA = make_tensor (
make_gmem_ptr (A_ptr),
make_layout ( make_shape (M, K), make_stride (K, 1 ))
);
// Each thread gets a slice
auto thread_layout = make_layout (
make_shape ( 32 ), // 32 threads per warp
make_stride ( 4 ) // Each thread stride 4
);
auto tAgA = local_partition (gA, thread_layout, thread_id);
// Each thread now accesses a strided slice of gA
Example 3: Tensor Core Layout
// MMA atom defines its own layout requirements
using MMA = MMA_Atom < SM80_16x8x16_F32F16F16F32_TN >;
// Partition shared memory according to MMA layout
auto sA = make_tensor (smem_ptr, smem_layout);
auto tCsA = mma . partition_A (sA);
// Each thread now has the correct fragment layout
auto tCrA = make_fragment_like (tCsA);
copy (tCsA, tCrA); // Load from smem to registers
Layout Optimization Checklist:
Global Memory:
✓ Use coalesced access patterns (adjacent threads → adjacent memory)
✓ Align pointers to 128 bits (16 bytes)
✓ Vectorize loads/stores (128-bit preferred)
Shared Memory:
✓ Apply swizzling to avoid bank conflicts
✓ Pad dimensions to avoid strided access
✓ Reuse data across thread iterations
Registers:
✓ Minimize register pressure (allow more occupancy)
✓ Use correct fragment layouts for Tensor Cores
✓ Prefer compile-time known indices
CuTe Layouts:
✓ Use static shapes when possible (better optimization)
✓ Let CuTe handle index computation (fewer bugs)
✓ Print layouts during development for debugging
Debugging Layouts
#include "cute/util/print.hpp"
if ( thread0 ()) {
// Print layout shape and stride
print ( "Layout: " );
print (layout);
print ( " \n " );
// Print specific coordinate mapping
auto coord = make_coord ( 10 , 20 );
print ( "coord " );
print (coord);
print ( " -> offset " );
print ( layout (coord));
print ( " \n " );
// Print human-readable layout visualization
print_layout (layout);
}
Advanced: TMA Layouts (SM90+)
Tensor Memory Accelerator requires special descriptor layouts:
// Create TMA descriptor for bulk tensor copy
auto tma_load = make_tma_copy (
SM90_TMA_LOAD{},
make_tensor (
make_gmem_ptr (global_ptr),
make_layout ( make_shape ( 128 , 128 ), make_stride ( 128 , 1 ))
),
make_layout ( make_shape ( 64 , 64 )) // Tile size
);
// Copy entire 64×64 tile in hardware
tma_load . copy (dst_tensor);
Next Steps
CuTe Library Deep dive into CuTe layout algebra
Tensor Cores Understand Tensor Core fragment layouts
GEMM Operations Apply layouts to GEMM kernels
Examples See layouts in real code