Convolution Example

This example demonstrates how to run 2D convolution kernels using CUTLASS with Tensor Core acceleration on NVIDIA Turing GPUs.

Overview

CUTLASS implements convolution as an implicit GEMM operation, transforming the convolution into a matrix multiplication that can leverage highly optimized GEMM kernels and Tensor Cores. The convolution operation computes:

Output[n,p,q,k] = alpha * Conv2D(Input[n,h,w,c], Filter[k,r,s,c]) + beta * Output[n,p,q,k]

Where:

Input: NHWC layout (batch, height, width, channels)
Filter: KRSC layout (output channels, filter height, filter width, input channels)
Output: NPQK layout (batch, output height, output width, output channels)

Key Concepts

Implicit GEMM: Convolution implemented as matrix multiplication
Tensor Cores: Using specialized hardware for int4 operations
Tile sizes: Hierarchical tiling at threadblock, warp, and instruction levels
Pipeline stages: Overlapping data movement with computation
NHWC layout: Optimized tensor layout for modern GPUs

Implementation

Define data types and layouts

Configure the convolution for int4 precision with Turing Tensor Cores:

#include "cutlass/conv/kernel/default_conv2d_fprop.h"
#include "cutlass/conv/device/implicit_gemm_convolution.h"

// Data types
using ElementAccumulator = int32_t;                  // Accumulator type
using ElementComputeEpilogue = float;                // Epilogue computation (alpha, beta)
using ElementInputA = cutlass::int4b_t;              // Input tensor (activations)
using ElementInputB = cutlass::int4b_t;              // Filter tensor (weights)
using ElementOutput = cutlass::int4b_t;              // Output tensor

// Tensor layouts (NHWC format)
using LayoutInputA = cutlass::layout::TensorNHWC;
using LayoutInputB = cutlass::layout::TensorNHWC;
using LayoutOutput = cutlass::layout::TensorNHWC;

Configure architecture and tile sizes

Set up the kernel configuration for optimal performance:

// Use Tensor Cores
using MMAOp = cutlass::arch::OpClassTensorOp;

// Target Turing architecture (SM75)
using SmArch = cutlass::arch::Sm75;

// Threadblock tile: 128x128x128 (MxNxK)
using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 128>;

// Warp tile: 64x64x128 (MxNxK)
using WarpShape = cutlass::gemm::GemmShape<64, 64, 128>;

// Tensor Core instruction shape: 8x8x32 (MxNxK)
using InstructionShape = cutlass::gemm::GemmShape<8, 8, 32>;

// Threadblock swizzling
using SwizzleThreadBlock = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>;

// Number of pipeline stages (for hiding latency)
constexpr int NumStages = 2;

Define the epilogue operation

Configure output scaling and clamping:

using EpilogueOp = cutlass::epilogue::thread::LinearCombinationClamp<
    ElementOutput,                                     // Output data type
    8,                                                 // Elements per vector access
    ElementAccumulator,                                // Accumulator type
    ElementComputeEpilogue>;                           // Alpha/beta type

Instantiate the convolution kernel

Combine all configurations into a convolution kernel:

using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop<
  ElementInputA, LayoutInputA,
  ElementInputB, LayoutInputB,
  ElementOutput, LayoutOutput,
  ElementAccumulator,
  MMAOp,
  SmArch,
  ThreadblockShape,
  WarpShape,
  InstructionShape,
  EpilogueOp,
  SwizzleThreadBlock,
  NumStages,
  cutlass::arch::OpMultiplyAddSaturate,
  cutlass::conv::IteratorAlgorithm::kAnalytic
>::Kernel;

using ImplicitGemm = cutlass::conv::device::ImplicitGemmConvolution<Conv2dFpropKernel>;

Setup problem dimensions

Define the convolution problem size:

// Input dimensions: NHWC (batch, height, width, channels)
cutlass::Tensor4DCoord input_size(1, 32, 32, 32);  // N=1, H=32, W=32, C=32

// Filter dimensions: KRSC (output channels, height, width, input channels)
cutlass::Tensor4DCoord filter_size(32, 3, 3, 32);  // K=32, R=3, S=3, C=32

// Padding: (top, bottom, left, right)
cutlass::Tensor4DCoord padding(1, 1, 1, 1);

// Stride and dilation
cutlass::MatrixCoord conv_stride(1, 1);
cutlass::MatrixCoord dilation(1, 1);

// Compute output dimensions
int output_h = (input_size.h() + padding.n() + padding.h() - filter_size.h()) / conv_stride.row() + 1;
int output_w = (input_size.w() + padding.w() + padding.c() - filter_size.w()) / conv_stride.column() + 1;
cutlass::Tensor4DCoord output_size(input_size.n(), output_h, output_w, filter_size.n());

Allocate and initialize tensors

Create host and device tensors:

// Allocate tensors
cutlass::HostTensor<ElementInputA, LayoutInputA> tensor_a(input_size);
cutlass::HostTensor<ElementInputB, LayoutInputB> tensor_b(filter_size);
cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c(output_size);
cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_c(output_size);

// Fill with random data
cutlass::reference::host::TensorFillRandomUniform(
    tensor_a.host_view(), 1, ElementInputA(7), ElementInputA(-8), 0);

cutlass::reference::host::TensorFillRandomUniform(
    tensor_b.host_view(), 1, ElementInputB(7), ElementInputB(-8), 0);

cutlass::reference::host::TensorFill(tensor_c.host_view());
cutlass::reference::host::TensorFill(tensor_ref_c.host_view());

// Copy to device
tensor_a.sync_device();
tensor_b.sync_device();
tensor_c.sync_device();
tensor_ref_c.sync_device();

Launch the convolution kernel

Setup and execute the convolution:

// Convolution mode (cross-correlation vs convolution)
cutlass::conv::Mode mode = cutlass::conv::Mode::kCrossCorrelation;

// Split-K dimension (parallel reduction)
int split_k_slices = 1;

// Create problem size descriptor
cutlass::conv::Conv2dProblemSize problem_size(
    input_size,
    filter_size,
    padding,
    conv_stride,
    dilation,
    output_size,
    mode,
    split_k_slices);

// Create kernel arguments
ElementComputeEpilogue alpha = 1.0f;
ElementComputeEpilogue beta = 0.0f;

typename ImplicitGemm::Arguments arguments{
  problem_size,
  tensor_a.device_ref(),
  tensor_b.device_ref(),
  tensor_c.device_ref(),
  tensor_c.device_ref(),
  {alpha, beta},
};

// Initialize kernel
ImplicitGemm implicit_gemm_op;

size_t workspace_size = implicit_gemm_op.get_workspace_size(arguments);
cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);

cutlass::Status status = implicit_gemm_op.can_implement(arguments);
if (status != cutlass::Status::kSuccess) {
  std::cerr << "Kernel cannot implement this problem" << std::endl;
  return -1;
}

status = implicit_gemm_op.initialize(arguments, workspace.get());
if (status != cutlass::Status::kSuccess) {
  std::cerr << "Failed to initialize kernel" << std::endl;
  return -1;
}

// Launch kernel
status = implicit_gemm_op();
if (status != cutlass::Status::kSuccess) {
  std::cerr << "Kernel execution failed" << std::endl;
  return -1;
}

Verify correctness

Compare against a reference implementation:

// Compute reference
cutlass::reference::host::Conv2dFprop<
  ElementInputA, LayoutInputA,
  ElementInputB, LayoutInputB,
  ElementOutput, LayoutOutput,
  ElementComputeEpilogue,
  ElementAccumulator,
  ElementOutput,
  cutlass::NumericConverterClamp<ElementOutput, ElementComputeEpilogue>
>(
  problem_size,
  tensor_a.host_ref(),
  tensor_b.host_ref(),
  tensor_c.host_ref(),
  tensor_ref_c.host_ref(),
  alpha,
  beta
);

// Compare results
tensor_c.sync_host();

bool passed = cutlass::reference::host::TensorEquals(
  tensor_c.host_view(),
  tensor_ref_c.host_view());

if (passed) {
  std::cout << "Passed." << std::endl;
} else {
  std::cout << "Failed." << std::endl;
}

Building and Running

Build the example

cd /path/to/cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS='75;80;86'
make 09_turing_tensorop_conv2dfprop

Run the example

# Run with default parameters
./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop

# Run with custom dimensions
./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop \
  --n=1 --h=224 --w=224 --c=32 --k=32 --r=3 --s=3

# Run with reference check
./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop \
  --n=32 --h=224 --w=224 --c=128 --k=256 --r=1 --s=1 --ref-check

# Benchmark mode
./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop --benchmark

Command-line options

--n=<int>: Batch size (N)
--h=<int>: Input height (H)
--w=<int>: Input width (W)
--c=<int>: Input channels (C)
--k=<int>: Output channels (K)
--r=<int>: Filter height (R)
--s=<int>: Filter width (S)
--alpha=<float>: Scaling factor alpha
--beta=<float>: Scaling factor beta
--ref-check: Enable reference verification
--perf-check: Enable performance measurement
--benchmark: Run performance benchmarks on multiple layer configurations
--iterations=<int>: Number of profiling iterations

Source Code Location

The complete source code for this example is available at:

examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu

What This Example Demonstrates

Implicit GEMM convolution: How CUTLASS transforms convolution into matrix multiplication
Tensor Core usage: Leveraging int4 Tensor Core operations for high performance
Hierarchical tiling: Composing threadblock, warp, and instruction-level tiles
Pipeline optimization: Using multiple stages to hide memory latency
NHWC layout: Working with channel-last tensor format
Alignment requirements: Handling int4 data type alignment constraints

Performance Optimization

Key factors for optimal performance:

Tile Sizes

Threadblock: 128×128×128 balances occupancy and reuse
Warp: 64×64×128 fits well within register file limits
Instruction: 8×8×32 matches Tensor Core operation size

Pipeline Stages

NumStages=2 overlaps global memory loads with computation
Higher values may improve performance but increase shared memory usage

Alignment

int4 operations require 32-element (128-bit) alignment
Input channels (C) and output channels (K) must be divisible by 32

Benchmark Results

The --benchmark mode tests various convolution layer configurations typical in ResNet:

./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop --benchmark

Example layers tested:

56×56×64 → 56×56×256 (1×1 convolution)
56×56×64 → 56×56×64 (3×3 convolution)
28×28×128 → 28×28×512 (1×1 convolution)
And many more…

Key Takeaways

CUTLASS implements convolution as implicit GEMM for maximum performance
Tensor Cores provide massive acceleration for low-precision convolution
Hierarchical tiling (threadblock → warp → instruction) enables efficient data reuse
Pipeline stages overlap data movement with computation to hide latency
Proper alignment is critical for int4 Tensor Core operations
NHWC layout is optimal for modern GPU architectures

Next Steps

Learn about Basic GEMM to understand the underlying GEMM operation
Explore Fused Operations to combine convolution with activations
Check out examples/16_ampere_tensorop_conv2dfprop/ for Ampere-specific optimizations
See examples/42_ampere_tensorop_group_conv/ for grouped convolutions

C++ Examples

Python Examples

CuTe DSL Examples

Convolution

Convolution Example

Overview

Key Concepts

Implementation

Building and Running

Build the example

Run the example

Command-line options

Source Code Location

What This Example Demonstrates

Performance Optimization

Tile Sizes

Pipeline Stages

Alignment

Benchmark Results

Key Takeaways

Next Steps

Build docs developers (and LLMs) love

C++ Examples

Python Examples

CuTe DSL Examples

​Convolution Example

​Overview

​Key Concepts

​Implementation

​Building and Running

​Build the example

​Run the example

​Command-line options

​Source Code Location

​What This Example Demonstrates

​Performance Optimization

​Tile Sizes

​Pipeline Stages

​Alignment

​Benchmark Results

​Key Takeaways

​Next Steps

Build docs developers (and LLMs) love

Convolution Example

Overview

Key Concepts

Implementation

Building and Running

Build the example

Run the example

Command-line options

Source Code Location

What This Example Demonstrates

Performance Optimization

Tile Sizes

Pipeline Stages

Alignment

Benchmark Results

Key Takeaways

Next Steps