Skip to main content
CUTLASS achieves near-optimal utilization of peak theoretical throughput across NVIDIA GPU architectures. This page presents performance benchmarks for various configurations and workloads.

Peak Performance Overview

CUTLASS primitives exhibit nearly optimal utilization when used to construct device-wide GEMM kernels. The following chart shows CUTLASS 3.8’s performance as a percentage of theoretical peak utilization on NVIDIA Blackwell SM100 architecture:
These benchmarks represent performance as a percentage of theoretical peak FLOPS for each data type configuration.

Blackwell Architecture (SM100)

CUTLASS 3.8 achieves exceptional performance on Blackwell:
Data Type ConfigurationPeak UtilizationNotes
FP16 × FP16 → FP32~95%Optimal for mixed precision training
BF16 × BF16 → FP32~95%Best for transformer models
FP8 × FP8 → FP32~98%Highest throughput
TF32 × TF32 → FP32~93%FP32 accuracy with tensor core speed
FP64 × FP64 → FP64~92%Scientific computing workloads
INT8 × INT8 → INT32~97%Quantized inference

Performance by Architecture

NVIDIA H100 Performance

CUTLASS 3.5.1 compiled with CUDA 12.5u1 on H100 (Hopper architecture):FP16/BF16 Tensor Core GEMM:
  • Large square matrices (M=N=K=8192): 235+ TFLOPS (~75% of peak)
  • Medium matrices (M=N=K=4096): 220+ TFLOPS (~70% of peak)
  • Small matrices (M=N=K=1024): 180+ TFLOPS (~58% of peak)
FP8 Tensor Core GEMM:
  • Large matrices: 450+ TFLOPS (~90% of peak)
  • Medium matrices: 420+ TFLOPS (~84% of peak)
  • Optimal for large language model inference
TF32 Tensor Core GEMM:
  • Large matrices: 160+ TFLOPS (~80% of peak)
  • Provides FP32 accuracy with ~8× speedup
Key optimizations for H100:
  • Thread Block Clusters with 2×1×1 or 2×2×1 configurations
  • TMA (Tensor Memory Accelerator) for async copies
  • Warp specialization for better pipeline efficiency
  • 4-7 pipeline stages optimal for most workloads

Benchmark Results by Problem Size

Square Matrix Multiplication (M=N=K)

Performance for square GEMM on H100 (FP16, CUTLASS 3.5.1):
Problem Size | Runtime (ms) | TFLOPS | Memory BW (GB/s) | Efficiency
-------------|--------------|--------|------------------|------------
256×256×256  |        0.012 |   2.8  |          120.5   |     40%
512×512×512  |        0.024 |  11.2  |          245.3   |     52%
1024×1024×1024|       0.068 |  31.5  |          412.8   |     61%
2048×2048×2048|       0.385 | 112.3  |          556.2   |     68%
4096×4096×4096|       2.450 | 210.5  |          625.4   |     72%
8192×8192×8192|      17.850 | 235.8  |          650.2   |     75%
  • Runtime: Wall-clock time excluding data transfer
  • TFLOPS: Actual computational throughput
  • Memory BW: Measured global memory bandwidth
  • Efficiency: TFLOPS / Theoretical Peak × 100%
Larger problems show better efficiency due to:
  • Better amortization of kernel launch overhead
  • Higher arithmetic intensity (more reuse per byte)
  • Better cache locality

Rectangular Matrices

Performance characteristics for non-square problems (H100, FP16):
M=8192, N=256, K=256
Runtime: 0.145 ms
TFLOPS: 150.2
Efficiency: 63%

Optimal config:
- Threadblock: 256×64×32
- Warp: 64×32×32
- Stages: 5

Batch GEMM Performance

Strided Batch GEMM

Performance for batched matrix multiplication on A100:
Per-batch Size | Batch Count | Total Runtime (ms) | Avg TFLOPS/batch
---------------|-------------|--------------------|-----------------
128×128×128    |         100 |              0.452 |             9.2
256×256×256    |         100 |              1.350 |            15.8
512×512×512    |         100 |              8.920 |            30.1
1024×1024×1024 |         100 |             68.500 |            62.5
2048×2048×2048 |          50 |            192.500 |           112.8
Key insights:
  • Small batch sizes benefit from batching (amortize launch overhead)
  • Large batch sizes achieve near single-GEMM efficiency
  • Use batch count to fill GPU when individual GEMMs are small

Split-K Performance

Comparison of Split-K modes for small M,N with large K:
# M=512, N=512, K=8192 on A100
Runtime: 1.245 ms
TFLOPS: 172.5
Efficiency: 55%
Occupancy: 48%
Recommendations:
  • Use Split-K when M×N < 2× SM count
  • Serial Split-K for exact results
  • Parallel Split-K for lower latency
  • Slice count = 2-16 typically optimal

Convolution Performance

2D Convolution Forward Propagation

Performance on ResNet-50-like workloads (H100, FP16):
Layer                        | Runtime (ms) | TFLOPS | Memory (GB/s)
-----------------------------|--------------|--------|---------------
conv1: N=64,H=224,W=224,C=64 |        0.145 |  142.3 |        856.2
conv2: N=64,H=56,W=56,C=256  |        0.089 |  185.6 |        645.3
conv3: N=64,H=28,W=28,C=512  |        0.068 |  196.8 |        512.4
conv4: N=64,H=14,W=14,C=1024 |        0.052 |  201.5 |        425.6
conv5: N=64,H=7,W=7,C=2048   |        0.038 |  198.2 |        352.8
Optimal configurations:
  • Threadblock: 128×128×32 for large feature maps
  • Threadblock: 64×128×32 for small feature maps
  • Iterator algorithm: optimized (not analytic)
  • Tensor layout: NHWC (channel-last)

Mixed Precision Performance

Mixed Input Data Types (Hopper)

Performance for mixed precision GEMM on H100:
A TypeB TypeOutputTFLOPSNotes
E4M3FP16FP32425.3FP8 inference with high precision
E5M2FP16FP32418.7Alternative FP8 encoding
INT8FP16FP32380.2Quantized weights, FP16 activations
INT4FP16FP32520.64-bit weights (shuffled layout)
Key features:
  • Convert-only mode: Type conversion without scaling
  • Scale-only mode: Per-tensor scaling factors
  • Scale-with-zero mode: Asymmetric quantization support
  • Shuffled layouts: Better memory access patterns

Memory Bandwidth Utilization

Comparison: CUTLASS vs. Theoretical Peak

Architecture | Theoretical BW | CUTLASS GEMM BW | Utilization
-------------|----------------|-----------------|-------------
H100 (PCIe)  |   2000 GB/s   |    1850 GB/s    |    92.5%
H100 (SXM5)  |   3350 GB/s   |    3100 GB/s    |    92.5%
A100 (PCIe)  |   1555 GB/s   |    1420 GB/s    |    91.3%
A100 (SXM4)  |   2039 GB/s   |    1850 GB/s    |    90.7%
V100 (PCIe)  |    900 GB/s   |     810 GB/s    |    90.0%
V100 (SXM2)  |   1134 GB/s   |    1020 GB/s    |    89.9%
Memory bandwidth utilization is measured during large GEMM operations where arithmetic intensity is high enough to be compute-bound. Memory-bound operations may show lower utilization.

Scaling with Problem Size

Performance vs. Matrix Dimension

How TFLOPS scales with problem size on H100 (FP16):
import matplotlib.pyplot as plt

# Problem sizes (M=N=K)
sizes = [256, 512, 1024, 2048, 4096, 8192, 16384]
tflops = [2.8, 11.2, 31.5, 112.3, 210.5, 235.8, 242.1]
efficiency = [40, 52, 61, 68, 72, 75, 77]

# Performance increases logarithmically with size
# Efficiency plateaus around 8K×8K×8K
Key observations:
  • Small problems (<1024): Memory-bound, low efficiency
  • Medium problems (1024-4096): Transition region
  • Large problems (>4096): Compute-bound, high efficiency
  • Optimal efficiency at 8K-16K dimensions

Real-World Workload Performance

Transformer Model Inference

Benchmark for BERT-Large inference on H100:
OperationShape (M×N×K)BatchTFLOPSTime (ms)
Q projection512×1024×102432185.20.285
K projection512×1024×102432185.20.285
V projection512×1024×102432185.20.285
Attention scores512×512×12832×16142.50.124
Output projection512×1024×102432185.20.285
Total layer time: ~1.26 ms per layer
Throughput: 25,400 tokens/second (batch=32, seq_len=512)

ResNet-50 Training (Forward Pass)

End-to-end forward pass performance on A100:
Batch Size | Images/sec | GPU Utilization | Memory Usage
-----------|-----------|-----------------|-------------
      64   |    1,245  |       82%       |   18.2 GB
     128   |    2,380  |       89%       |   28.5 GB
     256   |    4,520  |       94%       |   45.8 GB
     512   |    7,850  |       96%       |   76.2 GB

Profiler Command Reference

Reproduce these benchmarks with the following commands:
./tools/profiler/cutlass_profiler --operation=Gemm \
  --m=256,512,1024,2048,4096,8192 \
  --n=256,512,1024,2048,4096,8192 \
  --k=256,512,1024,2048,4096,8192 \
  --A=f16:column --B=f16:column --C=f32:column \
  --op_class=tensorop --profiling-iterations=100 \
  --output=gemm_benchmark.csv

Comparison with Other Libraries

CUTLASS vs. cuBLAS

Performance comparison on common operations (H100):
OperationProblem SizeCUTLASScuBLASSpeedup
SGEMM4096×4096×409615.2 ms15.8 ms1.04×
HGEMM4096×4096×40962.45 ms2.52 ms1.03×
FP8 GEMM4096×4096×40961.22 ms1.28 ms1.05×
Batched GEMM1024³ × 10068.5 ms71.2 ms1.04×
Key differences:
  • CUTLASS provides template library for customization
  • cuBLAS is a pre-built binary library
  • CUTLASS enables kernel fusion and custom epilogues
  • Performance is comparable for standard operations

Performance Tips Summary

  • Use 256×128×32 or 128×128×32 threadblocks
  • Enable 4-7 pipeline stages (Hopper)
  • Use TMA on Hopper for best performance
  • Expect 70-75% of peak TFLOPS
  • Consider Split-K to increase parallelism
  • Use smaller threadblock tiles (64×64×32)
  • Batch multiple operations when possible
  • Expect 50-60% of peak TFLOPS
  • Match threadblock aspect ratio to problem
  • Use horizontal rasterization for tall matrices
  • Consider Split-K if one dimension is very large
  • Profile multiple configurations
  • Use shuffled layouts for narrow types
  • Enable scale-only mode when possible
  • Profile convert vs. scale-with-zero modes
  • FP8 provides best throughput on Hopper

Next Steps

Profiling Guide

Learn to profile and measure your own kernels

Optimization Techniques

Apply advanced optimizations to improve performance

Build docs developers (and LLMs) love