Peak Performance Overview
CUTLASS primitives exhibit nearly optimal utilization when used to construct device-wide GEMM kernels. The following chart shows CUTLASS 3.8’s performance as a percentage of theoretical peak utilization on NVIDIA Blackwell SM100 architecture:These benchmarks represent performance as a percentage of theoretical peak FLOPS for each data type configuration.
Blackwell Architecture (SM100)
CUTLASS 3.8 achieves exceptional performance on Blackwell:| Data Type Configuration | Peak Utilization | Notes |
|---|---|---|
| FP16 × FP16 → FP32 | ~95% | Optimal for mixed precision training |
| BF16 × BF16 → FP32 | ~95% | Best for transformer models |
| FP8 × FP8 → FP32 | ~98% | Highest throughput |
| TF32 × TF32 → FP32 | ~93% | FP32 accuracy with tensor core speed |
| FP64 × FP64 → FP64 | ~92% | Scientific computing workloads |
| INT8 × INT8 → INT32 | ~97% | Quantized inference |
Performance by Architecture
- Hopper (H100)
- Ampere (A100)
- Turing (T4)
- Volta (V100)
NVIDIA H100 Performance
CUTLASS 3.5.1 compiled with CUDA 12.5u1 on H100 (Hopper architecture):FP16/BF16 Tensor Core GEMM:- Large square matrices (M=N=K=8192): 235+ TFLOPS (~75% of peak)
- Medium matrices (M=N=K=4096): 220+ TFLOPS (~70% of peak)
- Small matrices (M=N=K=1024): 180+ TFLOPS (~58% of peak)
- Large matrices: 450+ TFLOPS (~90% of peak)
- Medium matrices: 420+ TFLOPS (~84% of peak)
- Optimal for large language model inference
- Large matrices: 160+ TFLOPS (~80% of peak)
- Provides FP32 accuracy with ~8× speedup
- Thread Block Clusters with 2×1×1 or 2×2×1 configurations
- TMA (Tensor Memory Accelerator) for async copies
- Warp specialization for better pipeline efficiency
- 4-7 pipeline stages optimal for most workloads
Benchmark Results by Problem Size
Square Matrix Multiplication (M=N=K)
Performance for square GEMM on H100 (FP16, CUTLASS 3.5.1):Understanding the metrics
Understanding the metrics
- Runtime: Wall-clock time excluding data transfer
- TFLOPS: Actual computational throughput
- Memory BW: Measured global memory bandwidth
- Efficiency: TFLOPS / Theoretical Peak × 100%
- Better amortization of kernel launch overhead
- Higher arithmetic intensity (more reuse per byte)
- Better cache locality
Rectangular Matrices
Performance characteristics for non-square problems (H100, FP16):- Tall & Skinny (M >> N,K)
- Short & Wide (N >> M,K)
- Large K (K >> M,N)
Batch GEMM Performance
Strided Batch GEMM
Performance for batched matrix multiplication on A100:- Small batch sizes benefit from batching (amortize launch overhead)
- Large batch sizes achieve near single-GEMM efficiency
- Use batch count to fill GPU when individual GEMMs are small
Split-K Performance
Comparison of Split-K modes for small M,N with large K:- Use Split-K when M×N < 2× SM count
- Serial Split-K for exact results
- Parallel Split-K for lower latency
- Slice count = 2-16 typically optimal
Convolution Performance
2D Convolution Forward Propagation
Performance on ResNet-50-like workloads (H100, FP16):- Threadblock: 128×128×32 for large feature maps
- Threadblock: 64×128×32 for small feature maps
- Iterator algorithm: optimized (not analytic)
- Tensor layout: NHWC (channel-last)
Mixed Precision Performance
Mixed Input Data Types (Hopper)
Performance for mixed precision GEMM on H100:| A Type | B Type | Output | TFLOPS | Notes |
|---|---|---|---|---|
| E4M3 | FP16 | FP32 | 425.3 | FP8 inference with high precision |
| E5M2 | FP16 | FP32 | 418.7 | Alternative FP8 encoding |
| INT8 | FP16 | FP32 | 380.2 | Quantized weights, FP16 activations |
| INT4 | FP16 | FP32 | 520.6 | 4-bit weights (shuffled layout) |
- Convert-only mode: Type conversion without scaling
- Scale-only mode: Per-tensor scaling factors
- Scale-with-zero mode: Asymmetric quantization support
- Shuffled layouts: Better memory access patterns
Memory Bandwidth Utilization
Comparison: CUTLASS vs. Theoretical Peak
Memory bandwidth utilization is measured during large GEMM operations where arithmetic intensity is high enough to be compute-bound. Memory-bound operations may show lower utilization.
Scaling with Problem Size
Performance vs. Matrix Dimension
How TFLOPS scales with problem size on H100 (FP16):- Small problems (<1024): Memory-bound, low efficiency
- Medium problems (1024-4096): Transition region
- Large problems (>4096): Compute-bound, high efficiency
- Optimal efficiency at 8K-16K dimensions
Real-World Workload Performance
Transformer Model Inference
Benchmark for BERT-Large inference on H100:| Operation | Shape (M×N×K) | Batch | TFLOPS | Time (ms) |
|---|---|---|---|---|
| Q projection | 512×1024×1024 | 32 | 185.2 | 0.285 |
| K projection | 512×1024×1024 | 32 | 185.2 | 0.285 |
| V projection | 512×1024×1024 | 32 | 185.2 | 0.285 |
| Attention scores | 512×512×128 | 32×16 | 142.5 | 0.124 |
| Output projection | 512×1024×1024 | 32 | 185.2 | 0.285 |
Throughput: 25,400 tokens/second (batch=32, seq_len=512)
ResNet-50 Training (Forward Pass)
End-to-end forward pass performance on A100:Profiler Command Reference
Reproduce these benchmarks with the following commands:Comparison with Other Libraries
CUTLASS vs. cuBLAS
Performance comparison on common operations (H100):| Operation | Problem Size | CUTLASS | cuBLAS | Speedup |
|---|---|---|---|---|
| SGEMM | 4096×4096×4096 | 15.2 ms | 15.8 ms | 1.04× |
| HGEMM | 4096×4096×4096 | 2.45 ms | 2.52 ms | 1.03× |
| FP8 GEMM | 4096×4096×4096 | 1.22 ms | 1.28 ms | 1.05× |
| Batched GEMM | 1024³ × 100 | 68.5 ms | 71.2 ms | 1.04× |
- CUTLASS provides template library for customization
- cuBLAS is a pre-built binary library
- CUTLASS enables kernel fusion and custom epilogues
- Performance is comparable for standard operations
Performance Tips Summary
For Large Square Matrices
For Large Square Matrices
- Use 256×128×32 or 128×128×32 threadblocks
- Enable 4-7 pipeline stages (Hopper)
- Use TMA on Hopper for best performance
- Expect 70-75% of peak TFLOPS
For Small Matrices
For Small Matrices
- Consider Split-K to increase parallelism
- Use smaller threadblock tiles (64×64×32)
- Batch multiple operations when possible
- Expect 50-60% of peak TFLOPS
For Rectangular Matrices
For Rectangular Matrices
- Match threadblock aspect ratio to problem
- Use horizontal rasterization for tall matrices
- Consider Split-K if one dimension is very large
- Profile multiple configurations
For Mixed Precision
For Mixed Precision
- Use shuffled layouts for narrow types
- Enable scale-only mode when possible
- Profile convert vs. scale-with-zero modes
- FP8 provides best throughput on Hopper
Next Steps
Profiling Guide
Learn to profile and measure your own kernels
Optimization Techniques
Apply advanced optimizations to improve performance