Introduction
CuTe (CUDA Templates) is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. It provides a composable, structured approach to writing high-performance CUDA kernels.CuTe is the foundation for CUTLASS 3.x kernels and provides a more modern, composable alternative to the CUTLASS 2.x hierarchy.
Core Concepts
CuTe is built around three fundamental abstractions:Layout
Describes the mapping from logical coordinates to linear indices
Tensor
Combines data (pointer) with a Layout to provide structured access
Algorithms
Operations on Tensors (copy, fill, GEMM, etc.)
Layouts
Layouts define the relationship between multi-dimensional coordinates and linear memory addresses.Layout Definition
include/cute/layout.hpp:98
Creating Layouts
Layout Operations
Creates a Layout from a shape and stride
Returns the total number of elements in the layout
Returns the number of dimensions (modes) in the layout
Returns the hierarchical depth of the layout
Simplifies the layout by merging adjacent modes with compatible strides
Tensors
Tensors combine a pointer with a Layout to provide structured multi-dimensional views of data.Tensor Definition
include/cute/tensor_impl.hpp
Creating Tensors
Tensor Operations
Creates a Tensor from a pointer and layout
Extracts a tile from a tensor at the given coordinate
Partitions a tensor across threads according to a thread layout
Creates a new tensor with the same shape and layout (useful for accumulators)
Practical Example: GEMM Kernel
Fromexamples/cute/tutorial/sgemm_1.cu:
Algorithms
CuTe provides high-level algorithms for common operations.Copy Operations
include/cute/algorithm/copy.hpp
Fill and Clear
Fill Operations
GEMM Algorithm
include/cute/algorithm/gemm.hpp
AXPBY (Linear Combination)
AXPBY
Special Layouts and Patterns
Swizzling
Swizzled layouts reduce shared memory bank conflicts:Swizzled Layout
Blocked Layouts
Blocked Layouts
Type System
Compile-Time Integers
Integral Constants
Tuples
Tuple Types
Atom Types
Atoms describe hardware-specific instruction patterns.Copy Atoms
Copy Atoms
MMA Atoms
MMA Atoms
Debugging and Visualization
Print Utilities
Printing
Compile-Time Assertions
Static Assertions
Key Functions Reference
Shape and Stride Utilities
Creates a Shape tuple from arguments
Creates a Stride tuple from arguments
Creates a coordinate tuple for indexing
Returns the size of mode I (or total size if I omitted)
Returns the shape of mode I (or full shape if I omitted)
Returns the stride of mode I (or full stride if I omitted)
Composition and Manipulation
Composes two layouts: layout_b ∘ layout_a
Returns the complementary layout within a given shape
Divides a layout into tiles
Divides and interleaves (for thread partitioning)
Best Practices
Use Static Shapes When Possible
Static shapes (
Int<N>) enable compile-time optimizations and better code generation.Leverage Layout Composition
Build complex layouts by composing simpler ones - this is more maintainable and often more efficient.
Partition Before Loop
Partition tensors across threads outside loops to avoid recomputation.
Use Typed Tensors
Let CuTe’s type system catch shape mismatches at compile time.
Advanced Topics
TMA (Tensor Memory Accelerator)
Hopper architecture’s hardware-accelerated tensor loads:TMA Copy
Warp-Specialized Kernels
Different warps perform different roles:Warp Specialization
See Also
CuTe Tutorials
Examples in
examples/cute/tutorial/ demonstrate progressive complexityGEMM API
High-level GEMM API built on CuTe (CUTLASS 3.x)
Architecture Guide
Learn about architecture-specific features
Performance Guide
Optimize CuTe-based kernels