Profiling with CUTLASS Profiler

The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations defined in the CUTLASS Instance Library. It can execute and profile GEMM, Sparse GEMM, Conv2d, and Conv3d kernels with comprehensive performance metrics.

Building the Profiler

The CUTLASS Profiler can be compiled from the CUTLASS build directory:

$ make cutlass_profiler -j

By default, only one tile size (typically 128x128) is instantiated for each data type, math instruction, and layout to reduce compilation time.

Building All Kernels

To instantiate all available kernels (warning: this results in tens of thousands of kernels and very long build times):

$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all
$ make cutlass_profiler -j16

Building a Subset of Kernels

For practical use, compile only the kernels you need using wildcard patterns:

$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' \
  -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
$ make cutlass_profiler -j16

Basic Usage

View the help message to see all available options:

$ ./tools/profiler/cutlass_profiler --help

Profiling Modes

The profiler supports several execution modes:

--mode=profile - Regular verification and profiling (default)
--mode=dry_run - No kernels are launched or workspaces allocated
--mode=enumerate - Lists all operation kinds and operations
--mode=trace - Executes a single device-side computation with no other kernel launches

Getting Device Information

$ ./tools/profiler/cutlass_profiler --device-info

Profiling GEMM Operations

Basic GEMM Profiling

Profile a single problem size:

$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --m=1024 --n=1024 --k=128

Example Output

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1

          Status: Success
    Verification: ON
     Disposition: Passed

          cuBLAS: Passed

       Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column \
                  --C=f32:column --alpha=1 --beta=0 --split_k_slices=1 \
                  --batch_count=1 --op_class=simt --accum=f32 \
                  --cta_m=128 --cta_n=128 --cta_k=8 --stages=2

           Bytes: 180355072  bytes
           FLOPs: 115992428544  flops

         Runtime: 6.73655  ms
          Memory: 24.934 GiB/s

            Math: 17218.4 GFLOP/s

=============================

Sweeping Problem Sizes

Use ranges to sweep over problem dimensions:

# Sweep K dimension from 8 to 4096 in increments of 8
$ ./tools/profiler/cutlass_profiler \
  --kernels=cutlass_simt_sgemm_128x128_nn \
  --m=4352 --n=4096 --k=8:4096:8

# Sweep multiple dimensions with comma-delimited values
$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --m=1024:4096:256 --n=1024:4096:256 --k=128:8192:128 \
  --beta=0,1,2.5

Tensor Core Operations

Profile kernels using Tensor Cores:

$ ./tools/profiler/cutlass_profiler --op_class=tensorop \
  --m=3456 --n=4096 --k=8192

Example output:

Operation: cutlass_tensorop_s16816gemm_f16_256x128_32x3_nn_align8

       Arguments: --m=3456 --n=4096 --k=8192 --A=f16:column --B=f16:column \
                  --C=f32:column --op_class=tensorop --accum=f32 \
                  --cta_m=256 --cta_n=128 --cta_k=32 --stages=3 \
                  --inst_m=16 --inst_n=8 --inst_k=16

           Bytes: 180355072  bytes
           FLOPs: 231956545536  flops

         Runtime: 0.98647  ms
          Memory: 170.272 GiB/s
            Math: 235138 GFLOP/s

Advanced Profiling Options

Filtering Kernels

Use wildcard patterns to filter which kernels to profile:

# Profile kernels matching pattern
$ ./tools/profiler/cutlass_profiler \
  --kernels="s1688*nt, s884*tn*align8" \
  --m=3456 --n=4096 --k=4096

# Exclude specific kernels
$ ./tools/profiler/cutlass_profiler \
  --kernels=sgemm --ignore-kernels="*splitk*"

Controlling Profiling Duration

# Profile with exactly 100 iterations
$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --m=2048 --n=2048 --k=2048 \
  --profiling-iterations=100

Data Distribution

Control how input tensors are initialized:

# Uniform distribution
$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --dist=uniform,min:0,max:3 --m=1024 --n=1024 --k=128

# Gaussian distribution
$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --dist=gaussian,mean:0,stddev:3 --m=1024 --n=1024 --k=128

# Sequential distribution
$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --dist=sequential,start:0,delta:1 --m=1024 --n=1024 --k=128

Verification Options

# Disable verification for faster profiling
$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --verification-enabled=false --m=4096 --n=4096 --k=4096

# Save workspace when results are incorrect
$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --save-workspace=incorrect --m=1024 --n=1024 --k=1024

# Set error tolerance
$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --epsilon=0.01 --nonzero-floor=1e-8

Output and Reporting

CSV Output

Save profiling results to a CSV file:

$ ./tools/profiler/cutlass_profiler \
  --kernels=cutlass_simt_sgemm_128x128_nn \
  --m=3456 --n=4096 --k=8:4096:8 \
  --output=report.csv

Adding Tags for Analysis

Prepend custom columns for easier pivot table generation:

$ ./tools/profiler/cutlass_profiler \
  --kernels=sgemm --m=3456 --n=4096 --k=8:4096:8 \
  --output=report.csv \
  --tags=cutlass:4.4,date:2026-03-01,config:baseline

JUnit XML Output

Generate JUnit-compatible XML reports:

$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --junit-output=results --m=1024 --n=1024 --k=1024

Performance Optimization Features

Exhaustive Kernel Search

Find the best performing kernel for your problem:

Search across all parameters

Enable exhaustive search to find the optimal kernel configuration:

$ ./tools/profiler/cutlass_profiler \
  --kernels=*gemm* \
  --enable-kernel-performance-search \
  --sort-results-flops-per-sec

Optimize for fixed problem size

Tune kernel parameters for a specific GEMM shape:

$ ./tools/profiler/cutlass_profiler \
  --kernels=*gemm* \
  --enable-best-kernel-for-fixed-shape \
  --m=6144 --n=6144 --k=6144 \
  --sort-results-flops-per-sec

Sweep multiple shapes

Test multiple problem sizes:

$ ./tools/profiler/cutlass_profiler \
  --kernels=*gemm* \
  --enable-best-kernel-for-fixed-shape \
  --m=1024,2048,4096 --n=1024,2048,4096 --k=1024,2048,4096 \
  --sort-results-flops-per-sec

CUDA Graphs

Reduce kernel launch overhead by using CUDA graphs:

$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --use-cuda-graphs=true --m=1024 --n=1024 --k=1024

Cluster Configuration (Hopper/Blackwell)

Control cluster shapes on modern architectures:

$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --cluster_m=2 --cluster_n=1 --cluster_k=1 \
  --m=2048 --n=2048 --k=2048

Profiling Convolutions

Conv2d Forward Propagation

$ ./tools/profiler/cutlass_profiler --operation=Conv2d \
  --Activation=f16:nhwc --Filter=f16:nhwc --Output=f16 \
  --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 \
  --pad_h=1 --pad_w=1 --stride_h=1 --stride_w=1

Example output:

Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc

           Bytes: 1130659840  bytes
           FLOPs: 118482796544  flops

         Runtime: 0.711496  ms
          Memory: 1479.99 GiB/s
            Math: 166526 GFLOP/s

Hopper-Specific Features

Instantiation Levels

Control the number of kernel variants generated:

$ cmake .. \
  -DCUTLASS_NVCC_ARCHS="90a" \
  -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f32_*" \
  -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="500" \
  -DCUTLASS_UNITY_BUILD_ENABLED=ON

The instantiation level is a 4-digit number controlling:

Digit 0: Instruction Shape
Digit 1: MMA Shape Multiplier
Digit 2: Cluster Shape
Digit 3: Schedule Pruning

Mixed Input Data Types

Profile mixed precision kernels:

$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --A=e4m3:column --B=f16:row \
  --runtime_input_datatype_a=e4m3 \
  --runtime_input_datatype_b=f16 \
  --m=2048 --n=2048 --k=2048

Performance Analysis Tips

Understanding GFLOP/s Results

The “Math” metric shows computational throughput in GFLOP/s. Compare this against the theoretical peak of your GPU:

NVIDIA H100: ~989 TFLOPS (FP16 Tensor Core)
NVIDIA A100: ~312 TFLOPS (FP16 Tensor Core)
NVIDIA V100: ~125 TFLOPS (FP16 Tensor Core)

Efficiency = (Measured GFLOP/s) / (Theoretical Peak) × 100%

Memory vs. Compute Bound

Compare the “Memory” and “Math” metrics:

If memory bandwidth is saturated but GFLOP/s is low, the kernel is memory-bound
If GFLOP/s is high but memory bandwidth has headroom, the kernel is compute-bound

Optimal kernels on modern GPUs should be compute-bound for large problem sizes.

Tile Size Selection

Tile sizes affect both performance and resource usage:

Larger tiles (256×256) are better for large matrices
Smaller tiles (64×64 or 128×128) may be better for smaller problems
Use --cta_m, --cta_n, --cta_k to override tile sizes

Common Profiling Workflows

Functional Testing

Quick functional test across various problem sizes:

$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --m=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
  --n=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
  --k=8,16,32,64,128,256,288,384,504,512,520 \
  --beta=0,1,2 --profiling-iterations=1 \
  --providers=cutlass --output=functional-test.csv

Performance Comparison

Compare CUTLASS against cuBLAS:

$ ./tools/profiler/cutlass_profiler --operation=Gemm \
  --verification-providers=cublas \
  --m=2048 --n=2048 --k=2048 \
  --providers=cutlass,cublas

Batch Processing with Kernel Lists

Profile specific kernels from a file:

# Create file with kernel names
$ echo "cutlass_tensorop_s16816gemm_f16_256x128_32x3_nn_align8" > kernels.txt
$ echo "cutlass_tensorop_s16816gemm_f16_128x256_32x3_nn_align8" >> kernels.txt

# Profile kernels from file
$ ./tools/profiler/cutlass_profiler \
  --kernels-file=kernels.txt \
  --m=4096 --n=4096 --k=4096

Get Started

Core Concepts

C++ API

Python DSL

Architecture Support

Performance

​Building the Profiler

​Building All Kernels

​Building a Subset of Kernels

​Basic Usage

​Profiling Modes

​Getting Device Information

​Profiling GEMM Operations

​Basic GEMM Profiling

​Example Output

​Sweeping Problem Sizes

​Tensor Core Operations

​Advanced Profiling Options

​Filtering Kernels

​Controlling Profiling Duration

​Data Distribution

​Verification Options

​Output and Reporting

​CSV Output

​Adding Tags for Analysis

​JUnit XML Output

​Performance Optimization Features

​Exhaustive Kernel Search

​CUDA Graphs

​Cluster Configuration (Hopper/Blackwell)

​Profiling Convolutions

​Conv2d Forward Propagation

​Hopper-Specific Features

​Instantiation Levels

​Mixed Input Data Types

​Performance Analysis Tips

​Common Profiling Workflows

​Functional Testing

​Performance Comparison

​Batch Processing with Kernel Lists

​Next Steps

Optimization Techniques

Benchmarks

Build docs developers (and LLMs) love

Building the Profiler

Building All Kernels

Building a Subset of Kernels

Basic Usage

Profiling Modes

Getting Device Information

Profiling GEMM Operations

Basic GEMM Profiling

Example Output

Sweeping Problem Sizes

Tensor Core Operations

Advanced Profiling Options

Filtering Kernels

Controlling Profiling Duration

Data Distribution

Verification Options

Output and Reporting

CSV Output

Adding Tags for Analysis

JUnit XML Output

Performance Optimization Features

Exhaustive Kernel Search

CUDA Graphs

Cluster Configuration (Hopper/Blackwell)

Profiling Convolutions

Conv2d Forward Propagation

Hopper-Specific Features

Instantiation Levels

Mixed Input Data Types

Performance Analysis Tips

Common Profiling Workflows

Functional Testing

Performance Comparison

Batch Processing with Kernel Lists

Next Steps