Skip to main content
CUTLASS supports various memory layouts for tensors. The layout determines how multi-dimensional tensors are stored in linear memory.

LayoutType Enum

The cutlass.LayoutType enum defines available memory layouts:
from cutlass import LayoutType

Basic Layouts

LayoutType.RowMajor

Row-major layout (C/C++ convention). Consecutive elements in a row are stored contiguously in memory.
import cutlass
from cutlass.op import Gemm

plan = Gemm(
    element=torch.float32,
    layout=cutlass.LayoutType.RowMajor
)
For a matrix with shape (M, N), element at position (i, j) is at memory offset: i * N + j Use case: Standard for C/C++ applications and most PyTorch operations.

LayoutType.ColumnMajor

Column-major layout (Fortran/BLAS convention). Consecutive elements in a column are stored contiguously in memory.
import cutlass
from cutlass.op import Gemm

plan = Gemm(
    element=torch.float32,
    layout=cutlass.LayoutType.ColumnMajor
)
For a matrix with shape (M, N), element at position (i, j) is at memory offset: j * M + i Use case: Interoperability with BLAS libraries, Fortran code, and column-major frameworks.

Interleaved Layouts

Interleaved layouts pack multiple elements together for improved memory access patterns with certain data types.

LayoutType.ColumnMajorInterleaved2

Column-major with 2-way interleaving.
layout = cutlass.LayoutType.ColumnMajorInterleaved2
Use case: Optimized access for 2-byte data types.

LayoutType.RowMajorInterleaved2

Row-major with 2-way interleaving.
layout = cutlass.LayoutType.RowMajorInterleaved2

LayoutType.ColumnMajorInterleaved32

Column-major with 32-way interleaving.
layout = cutlass.LayoutType.ColumnMajorInterleaved32
Use case: INT4/INT8 operations on Tensor Cores.

LayoutType.RowMajorInterleaved32

Row-major with 32-way interleaving.
layout = cutlass.LayoutType.RowMajorInterleaved32

LayoutType.ColumnMajorInterleaved64

Column-major with 64-way interleaving.
layout = cutlass.LayoutType.ColumnMajorInterleaved64

LayoutType.RowMajorInterleaved64

Row-major with 64-way interleaving.
layout = cutlass.LayoutType.RowMajorInterleaved64

Tensor Layouts

Tensor layouts are used for convolution operations and multi-dimensional tensors.

LayoutType.TensorNHWC

Tensor layout with dimensions ordered as (N, H, W, C) - commonly used in computer vision.
  • N: Batch size
  • H: Height
  • W: Width
  • C: Channels
layout = cutlass.LayoutType.TensorNHWC
Use case: Standard for image processing and CNNs in frameworks like TensorFlow.

LayoutType.TensorNCHW

Tensor layout with dimensions ordered as (N, C, H, W).
layout = cutlass.LayoutType.TensorNCHW
Use case: Standard for PyTorch CNNs and cuDNN operations.

LayoutType.TensorNDHWC

5D tensor layout for 3D convolutions: (N, D, H, W, C).
layout = cutlass.LayoutType.TensorNDHWC
Use case: 3D convolutions in video processing and volumetric data.

LayoutType.TensorNWC

3D tensor layout: (N, W, C).
layout = cutlass.LayoutType.TensorNWC
Use case: 1D convolutions and sequence processing.

Layout Selection Guide

Performance Considerations

Rule of Thumb: Use the layout that matches your input data to avoid expensive transpose operations.
ScenarioRecommended LayoutReason
PyTorch matricesRowMajorDefault PyTorch layout
NumPy matricesRowMajorDefault NumPy layout
BLAS/LAPACK interopColumnMajorBLAS convention
INT8 GEMMColumnMajorInterleaved32Optimized Tensor Core access
CNN inputs (TF)TensorNHWCTensorFlow default
CNN inputs (PyTorch)TensorNCHWPyTorch default

Alignment Requirements

Some layouts require specific alignment:
  • Interleaved layouts require dimensions divisible by interleaving factor
  • TensorCore operations may require 8-byte or 16-byte alignment
# Ensure M, N, K are multiples of 8 for optimal FP16 performance
M = ((M + 7) // 8) * 8
N = ((N + 7) // 8) * 8  
K = ((K + 7) // 8) * 8

Layout Conversion

PyTorch Transpose

Convert between row-major and column-major in PyTorch:
import torch

# Row-major tensor
A_row = torch.randn((M, K), device='cuda')

# Convert to column-major (transpose + contiguous)
A_col = A_row.t().contiguous()

# Note: .t() creates a view, .contiguous() copies to new layout

NumPy Transpose

import numpy as np

# Row-major (C order)
A_c = np.array([[1, 2, 3], [4, 5, 6]], order='C')

# Column-major (Fortran order)
A_f = np.array([[1, 2, 3], [4, 5, 6]], order='F')

# Convert
A_f_from_c = np.asfortranarray(A_c)
A_c_from_f = np.ascontiguousarray(A_f)

Layout in GEMM Operations

Example: Row-Major GEMM

import torch
import cutlass
from cutlass.op import Gemm

plan = Gemm(
    element=torch.float32,
    layout_A=cutlass.LayoutType.RowMajor,
    layout_B=cutlass.LayoutType.RowMajor,
    layout_C=cutlass.LayoutType.RowMajor
)

A = torch.randn((M, K), device='cuda')  # Row-major
B = torch.randn((K, N), device='cuda')  # Row-major
C = torch.zeros((M, N), device='cuda')  # Row-major
D = torch.zeros((M, N), device='cuda')  # Row-major

plan.run(A, B, C, D)

Example: Mixed Layouts

import torch
import cutlass
from cutlass.op import Gemm

# A is column-major, B is row-major, C is row-major
plan = Gemm(
    element=torch.float32,
    layout_A=cutlass.LayoutType.ColumnMajor,
    layout_B=cutlass.LayoutType.RowMajor,
    layout_C=cutlass.LayoutType.RowMajor
)

A_col = torch.randn((M, K), device='cuda').t().contiguous()
B_row = torch.randn((K, N), device='cuda')
C_row = torch.zeros((M, N), device='cuda')
D_row = torch.zeros((M, N), device='cuda')

plan.run(A_col, B_row, C_row, D_row)

Layout Naming Convention

CUTLASS uses shorthand notation in kernel names:
LayoutShorthandExample
ColumnMajorncutlass_gemm_n
RowMajortcutlass_gemm_t
ColumnMajorInterleaved32n32cutlass_gemm_n32
RowMajorInterleaved32t32cutlass_gemm_t32
TensorNHWCnhwccutlass_conv_nhwc

C++ Mapping

Python layout types map directly to C++ CUTLASS layout types:
// Python: cutlass.LayoutType.RowMajor
// C++:    cutlass::layout::RowMajor

// Python: cutlass.LayoutType.ColumnMajor  
// C++:    cutlass::layout::ColumnMajor

// Python: cutlass.LayoutType.TensorNHWC
// C++:    cutlass::layout::TensorNHWC

Source Code References

See Also

Build docs developers (and LLMs) love