Example Categories
Ampere Examples
Kernels for NVIDIA Ampere architecture (SM80)
Hopper Examples
Advanced kernels using Hopper features (SM90)
Blackwell Examples
Latest kernels for Blackwell architecture
Jupyter Notebooks
Interactive tutorials and guides
Ampere Examples (SM80)
Elementwise Addition
Basic example demonstrating CuTe DSL fundamentals.- Code
- Run
- Concepts
SIMT GEMM (FP32)
Dense matrix multiplication using floating-point units.- 3-stage software pipeline (overlaps gmem→smem with compute)
- 2-stage register pipeline (overlaps smem→rmem with compute)
- Shared memory padding to avoid bank conflicts
- Vectorized memory accesses (128-bit loads)
- Predication for irregular tile shapes
TensorOp GEMM (FP16)
GEMM using Ampere Tensor Cores.Flash Attention v2
Fused multi-head attention with tiling for memory efficiency.- Online softmax computation (no materializing attention matrix)
- Tiling for long sequences
- Shared memory management
- Causal masking support
Hopper Examples (SM90)
TMA GEMM
GEMM using Tensor Memory Accelerator for efficient bulk transfers.- TMA for high-bandwidth gmem→smem transfers
- Asynchronous operation
- Hardware-managed addressing
- Reduced register pressure
Warp-Specialized GEMM
Separate producer/consumer warps for maximum throughput.Blackwell Examples
FP16 GEMM Tutorial
Optimized GEMM for Blackwell architecture.Blockwise GEMM
Decompose large GEMMs into independent blocks.Jupyter Notebooks
Interactive tutorials available inexamples/python/CuTeDSL/notebooks/:
Hello World
hello_world.ipynbIntroduction to CuTe DSL basics
CuTe Layout Algebra
cute_layout_algebra.ipynbDeep dive into layout algebra and composition
Tensor Operations
tensor.ipynbWorking with tensors, indexing, and partitioning
Data Types
data_types.ipynbUsing different numeric types (FP16, FP8, INT8, etc.)
Async Pipeline
async_pipeline.ipynbMulti-stage pipelines with async copies
Autotuning
benchmark_autotune.ipynbAutomatically find best kernel parameters
CUDA Graphs
cuda_graphs.ipynbReduce kernel launch overhead with CUDA graphs
Tour to Sol GEMM
tour_to_sol_gemm.ipynbStep-by-step GEMM optimization guide
Running Notebooks
More Examples
Advanced Examples
Framework Integration Examples
PyTorch CUDA Extension
Benchmarking
All examples support benchmarking mode:Profiling
NCU (NVIDIA Compute Profiler)
NSight Systems
Next Steps
CuTe DSL Guide
Learn CuTe DSL concepts in depth
PyTorch Integration
Integrate kernels with PyTorch
GitHub Repository
Quickstart
Get started quickly