Overview
CuTe DSL allows you to write CUDA kernels in Python that are compiled to efficient GPU code, with full access to:- Layout Algebra - Express complex memory access patterns
- Tensor Abstractions - High-level tensor operations
- Hardware Features - Tensor Cores, TMA, async copy, barriers
- JIT Compilation - Automatic optimization and compilation
CuTe DSL kernels achieve performance comparable to hand-written CUDA C++ while offering Python’s development productivity.
Installation
- CUDA 12.0+ (CUDA 13 for latest features)
- Python 3.8+
- GPU: Ampere (SM80+), Hopper (SM90), or Blackwell
Core Concepts
Decorators
@cute.kernel
Defines a CUDA kernel that runs on the GPU:
- Type annotations specify tensor types
- Access thread/block indices with
cute.arch - Use CuTe operations inside the kernel
@cute.jit
Marks a host function for JIT compilation:
Layouts
Layouts define the mapping between logical coordinates and memory addresses.Creating Layouts
Composed Layouts
Create hierarchical layouts for tiling:Thread-Value (TV) Layouts
Map threads to data elements:Tensors
Tensors combine data pointers with layouts:Tensor Operations
Copy Operations
Copy Atoms
Define how data is copied:Tiled Copy
Distribute copy operations across threads:Fragments
Register memory storage:MMA (Matrix Multiply-Accumulate)
Utilize Tensor Cores for matrix multiplication:Shared Memory
Allocate and manage shared memory:Synchronization
Control Flow
Compile-Time Ranges
Dynamic Control Flow
Complete Example: Elementwise Operation
Advanced Features
Multi-Stage Pipelines
Overlap memory transfers with computation:TMA (Tensor Memory Accelerator)
Hopper+ feature for efficient bulk transfers:PyTorch Integration
Seamless integration with PyTorch:JAX Integration
Debugging
Print Statements
Compile with Debug Info
Use NCU (NVIDIA Compute Profiler)
Performance Tips
Memory Coalescing
Memory Coalescing
Ensure contiguous threads access contiguous memory:
Bank Conflicts
Bank Conflicts
Add padding to shared memory layouts:
Vectorization
Vectorization
Use 128-bit loads/stores when possible:
Occupancy
Occupancy
Balance threads, registers, and shared memory:
API Summary
Decorators
@cute.kernel- Define GPU kernel@cute.jit- Mark function for JIT compilation
Layout Functions
cute.make_layout(shape, stride)- Create layoutcute.make_ordered_layout(shape, order)- Create ordered layoutcute.make_layout_tv(thr_layout, val_layout)- Create TV layout
Tensor Functions
cute.make_tensor(ptr, layout)- Create tensorcute.make_fragment_like(tensor)- Create fragmentcute.make_rmem_tensor(shape, dtype)- Allocate registerscute.make_identity_tensor(shape)- Create coordinate tensorcute.zipped_divide(tensor, tiler)- Tile tensorcute.local_tile(tensor, tiler, coord, proj)- Local tilecute.size(tensor, mode)- Get size
Copy Functions
cute.make_copy_atom(op, dtype, ...)- Create copy atomcute.make_tiled_copy_tv(atom, thr, val)- Create tiled copycute.copy(atom, src, dst, pred)- Perform copycute.autovec_copy(src, dst)- Auto-vectorized copy
MMA Functions
cute.nvgpu.MmaUniversalOp(dtype)- Create MMA opcute.make_tiled_mma(op, layout, ...)- Create tiled MMAcute.gemm(mma, c, a, b, c)- Matrix multiply
Arch Functions
cute.arch.thread_idx()- Thread indexcute.arch.block_idx()- Block indexcute.arch.block_dim()- Block dimensionscute.arch.grid_dim()- Grid dimensionscute.arch.syncthreads()- Block barriercute.arch.cp_async_commit_group()- Commit asynccute.arch.cp_async_wait_group(n)- Wait async
Compilation
cute.compile(fn, *args, **opts)- Compile functioncute.compile[options](fn, *args)- Compile with options
Next Steps
Examples
Explore complete kernel examples
Quickstart
Quick introduction to CuTe DSL
PyTorch Integration
Build PyTorch extensions