CUTLASS supports various memory layouts for tensors. The layout determines how multi-dimensional tensors are stored in linear memory.
LayoutType Enum
The cutlass.LayoutType enum defines available memory layouts:
from cutlass import LayoutType
Basic Layouts
LayoutType.RowMajor
Row-major layout (C/C++ convention). Consecutive elements in a row are stored contiguously in memory.
import cutlass
from cutlass.op import Gemm
plan = Gemm(
element=torch.float32,
layout=cutlass.LayoutType.RowMajor
)
For a matrix with shape (M, N), element at position (i, j) is at memory offset: i * N + j
Use case: Standard for C/C++ applications and most PyTorch operations.
LayoutType.ColumnMajor
Column-major layout (Fortran/BLAS convention). Consecutive elements in a column are stored contiguously in memory.
import cutlass
from cutlass.op import Gemm
plan = Gemm(
element=torch.float32,
layout=cutlass.LayoutType.ColumnMajor
)
For a matrix with shape (M, N), element at position (i, j) is at memory offset: j * M + i
Use case: Interoperability with BLAS libraries, Fortran code, and column-major frameworks.
Interleaved Layouts
Interleaved layouts pack multiple elements together for improved memory access patterns with certain data types.
LayoutType.ColumnMajorInterleaved2
Column-major with 2-way interleaving.
layout = cutlass.LayoutType.ColumnMajorInterleaved2
Use case: Optimized access for 2-byte data types.
LayoutType.RowMajorInterleaved2
Row-major with 2-way interleaving.
layout = cutlass.LayoutType.RowMajorInterleaved2
LayoutType.ColumnMajorInterleaved32
Column-major with 32-way interleaving.
layout = cutlass.LayoutType.ColumnMajorInterleaved32
Use case: INT4/INT8 operations on Tensor Cores.
LayoutType.RowMajorInterleaved32
Row-major with 32-way interleaving.
layout = cutlass.LayoutType.RowMajorInterleaved32
LayoutType.ColumnMajorInterleaved64
Column-major with 64-way interleaving.
layout = cutlass.LayoutType.ColumnMajorInterleaved64
LayoutType.RowMajorInterleaved64
Row-major with 64-way interleaving.
layout = cutlass.LayoutType.RowMajorInterleaved64
Tensor Layouts
Tensor layouts are used for convolution operations and multi-dimensional tensors.
LayoutType.TensorNHWC
Tensor layout with dimensions ordered as (N, H, W, C) - commonly used in computer vision.
- N: Batch size
- H: Height
- W: Width
- C: Channels
layout = cutlass.LayoutType.TensorNHWC
Use case: Standard for image processing and CNNs in frameworks like TensorFlow.
LayoutType.TensorNCHW
Tensor layout with dimensions ordered as (N, C, H, W).
layout = cutlass.LayoutType.TensorNCHW
Use case: Standard for PyTorch CNNs and cuDNN operations.
LayoutType.TensorNDHWC
5D tensor layout for 3D convolutions: (N, D, H, W, C).
layout = cutlass.LayoutType.TensorNDHWC
Use case: 3D convolutions in video processing and volumetric data.
LayoutType.TensorNWC
3D tensor layout: (N, W, C).
layout = cutlass.LayoutType.TensorNWC
Use case: 1D convolutions and sequence processing.
Layout Selection Guide
Rule of Thumb: Use the layout that matches your input data to avoid expensive transpose operations.
| Scenario | Recommended Layout | Reason |
|---|
| PyTorch matrices | RowMajor | Default PyTorch layout |
| NumPy matrices | RowMajor | Default NumPy layout |
| BLAS/LAPACK interop | ColumnMajor | BLAS convention |
| INT8 GEMM | ColumnMajorInterleaved32 | Optimized Tensor Core access |
| CNN inputs (TF) | TensorNHWC | TensorFlow default |
| CNN inputs (PyTorch) | TensorNCHW | PyTorch default |
Alignment Requirements
Some layouts require specific alignment:
- Interleaved layouts require dimensions divisible by interleaving factor
- TensorCore operations may require 8-byte or 16-byte alignment
# Ensure M, N, K are multiples of 8 for optimal FP16 performance
M = ((M + 7) // 8) * 8
N = ((N + 7) // 8) * 8
K = ((K + 7) // 8) * 8
Layout Conversion
PyTorch Transpose
Convert between row-major and column-major in PyTorch:
import torch
# Row-major tensor
A_row = torch.randn((M, K), device='cuda')
# Convert to column-major (transpose + contiguous)
A_col = A_row.t().contiguous()
# Note: .t() creates a view, .contiguous() copies to new layout
NumPy Transpose
import numpy as np
# Row-major (C order)
A_c = np.array([[1, 2, 3], [4, 5, 6]], order='C')
# Column-major (Fortran order)
A_f = np.array([[1, 2, 3], [4, 5, 6]], order='F')
# Convert
A_f_from_c = np.asfortranarray(A_c)
A_c_from_f = np.ascontiguousarray(A_f)
Layout in GEMM Operations
Example: Row-Major GEMM
import torch
import cutlass
from cutlass.op import Gemm
plan = Gemm(
element=torch.float32,
layout_A=cutlass.LayoutType.RowMajor,
layout_B=cutlass.LayoutType.RowMajor,
layout_C=cutlass.LayoutType.RowMajor
)
A = torch.randn((M, K), device='cuda') # Row-major
B = torch.randn((K, N), device='cuda') # Row-major
C = torch.zeros((M, N), device='cuda') # Row-major
D = torch.zeros((M, N), device='cuda') # Row-major
plan.run(A, B, C, D)
Example: Mixed Layouts
import torch
import cutlass
from cutlass.op import Gemm
# A is column-major, B is row-major, C is row-major
plan = Gemm(
element=torch.float32,
layout_A=cutlass.LayoutType.ColumnMajor,
layout_B=cutlass.LayoutType.RowMajor,
layout_C=cutlass.LayoutType.RowMajor
)
A_col = torch.randn((M, K), device='cuda').t().contiguous()
B_row = torch.randn((K, N), device='cuda')
C_row = torch.zeros((M, N), device='cuda')
D_row = torch.zeros((M, N), device='cuda')
plan.run(A_col, B_row, C_row, D_row)
Layout Naming Convention
CUTLASS uses shorthand notation in kernel names:
| Layout | Shorthand | Example |
|---|
| ColumnMajor | n | cutlass_gemm_n |
| RowMajor | t | cutlass_gemm_t |
| ColumnMajorInterleaved32 | n32 | cutlass_gemm_n32 |
| RowMajorInterleaved32 | t32 | cutlass_gemm_t32 |
| TensorNHWC | nhwc | cutlass_conv_nhwc |
C++ Mapping
Python layout types map directly to C++ CUTLASS layout types:
// Python: cutlass.LayoutType.RowMajor
// C++: cutlass::layout::RowMajor
// Python: cutlass.LayoutType.ColumnMajor
// C++: cutlass::layout::ColumnMajor
// Python: cutlass.LayoutType.TensorNHWC
// C++: cutlass::layout::TensorNHWC
Source Code References
See Also