Skip to main content

Convolution Example

This example demonstrates how to run 2D convolution kernels using CUTLASS with Tensor Core acceleration on NVIDIA Turing GPUs.

Overview

CUTLASS implements convolution as an implicit GEMM operation, transforming the convolution into a matrix multiplication that can leverage highly optimized GEMM kernels and Tensor Cores. The convolution operation computes:
Output[n,p,q,k] = alpha * Conv2D(Input[n,h,w,c], Filter[k,r,s,c]) + beta * Output[n,p,q,k]
Where:
  • Input: NHWC layout (batch, height, width, channels)
  • Filter: KRSC layout (output channels, filter height, filter width, input channels)
  • Output: NPQK layout (batch, output height, output width, output channels)

Key Concepts

  • Implicit GEMM: Convolution implemented as matrix multiplication
  • Tensor Cores: Using specialized hardware for int4 operations
  • Tile sizes: Hierarchical tiling at threadblock, warp, and instruction levels
  • Pipeline stages: Overlapping data movement with computation
  • NHWC layout: Optimized tensor layout for modern GPUs

Implementation

1
Define data types and layouts
2
Configure the convolution for int4 precision with Turing Tensor Cores:
3
#include "cutlass/conv/kernel/default_conv2d_fprop.h"
#include "cutlass/conv/device/implicit_gemm_convolution.h"

// Data types
using ElementAccumulator = int32_t;                  // Accumulator type
using ElementComputeEpilogue = float;                // Epilogue computation (alpha, beta)
using ElementInputA = cutlass::int4b_t;              // Input tensor (activations)
using ElementInputB = cutlass::int4b_t;              // Filter tensor (weights)
using ElementOutput = cutlass::int4b_t;              // Output tensor

// Tensor layouts (NHWC format)
using LayoutInputA = cutlass::layout::TensorNHWC;
using LayoutInputB = cutlass::layout::TensorNHWC;
using LayoutOutput = cutlass::layout::TensorNHWC;
4
Configure architecture and tile sizes
5
Set up the kernel configuration for optimal performance:
6
// Use Tensor Cores
using MMAOp = cutlass::arch::OpClassTensorOp;

// Target Turing architecture (SM75)
using SmArch = cutlass::arch::Sm75;

// Threadblock tile: 128x128x128 (MxNxK)
using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 128>;

// Warp tile: 64x64x128 (MxNxK)
using WarpShape = cutlass::gemm::GemmShape<64, 64, 128>;

// Tensor Core instruction shape: 8x8x32 (MxNxK)
using InstructionShape = cutlass::gemm::GemmShape<8, 8, 32>;

// Threadblock swizzling
using SwizzleThreadBlock = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>;

// Number of pipeline stages (for hiding latency)
constexpr int NumStages = 2;
7
Define the epilogue operation
8
Configure output scaling and clamping:
9
using EpilogueOp = cutlass::epilogue::thread::LinearCombinationClamp<
    ElementOutput,                                     // Output data type
    8,                                                 // Elements per vector access
    ElementAccumulator,                                // Accumulator type
    ElementComputeEpilogue>;                           // Alpha/beta type
10
Instantiate the convolution kernel
11
Combine all configurations into a convolution kernel:
12
using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop<
  ElementInputA, LayoutInputA,
  ElementInputB, LayoutInputB,
  ElementOutput, LayoutOutput,
  ElementAccumulator,
  MMAOp,
  SmArch,
  ThreadblockShape,
  WarpShape,
  InstructionShape,
  EpilogueOp,
  SwizzleThreadBlock,
  NumStages,
  cutlass::arch::OpMultiplyAddSaturate,
  cutlass::conv::IteratorAlgorithm::kAnalytic
>::Kernel;

using ImplicitGemm = cutlass::conv::device::ImplicitGemmConvolution<Conv2dFpropKernel>;
13
Setup problem dimensions
14
Define the convolution problem size:
15
// Input dimensions: NHWC (batch, height, width, channels)
cutlass::Tensor4DCoord input_size(1, 32, 32, 32);  // N=1, H=32, W=32, C=32

// Filter dimensions: KRSC (output channels, height, width, input channels)
cutlass::Tensor4DCoord filter_size(32, 3, 3, 32);  // K=32, R=3, S=3, C=32

// Padding: (top, bottom, left, right)
cutlass::Tensor4DCoord padding(1, 1, 1, 1);

// Stride and dilation
cutlass::MatrixCoord conv_stride(1, 1);
cutlass::MatrixCoord dilation(1, 1);

// Compute output dimensions
int output_h = (input_size.h() + padding.n() + padding.h() - filter_size.h()) / conv_stride.row() + 1;
int output_w = (input_size.w() + padding.w() + padding.c() - filter_size.w()) / conv_stride.column() + 1;
cutlass::Tensor4DCoord output_size(input_size.n(), output_h, output_w, filter_size.n());
16
Allocate and initialize tensors
17
Create host and device tensors:
18
// Allocate tensors
cutlass::HostTensor<ElementInputA, LayoutInputA> tensor_a(input_size);
cutlass::HostTensor<ElementInputB, LayoutInputB> tensor_b(filter_size);
cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c(output_size);
cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_c(output_size);

// Fill with random data
cutlass::reference::host::TensorFillRandomUniform(
    tensor_a.host_view(), 1, ElementInputA(7), ElementInputA(-8), 0);

cutlass::reference::host::TensorFillRandomUniform(
    tensor_b.host_view(), 1, ElementInputB(7), ElementInputB(-8), 0);

cutlass::reference::host::TensorFill(tensor_c.host_view());
cutlass::reference::host::TensorFill(tensor_ref_c.host_view());

// Copy to device
tensor_a.sync_device();
tensor_b.sync_device();
tensor_c.sync_device();
tensor_ref_c.sync_device();
19
Launch the convolution kernel
20
Setup and execute the convolution:
21
// Convolution mode (cross-correlation vs convolution)
cutlass::conv::Mode mode = cutlass::conv::Mode::kCrossCorrelation;

// Split-K dimension (parallel reduction)
int split_k_slices = 1;

// Create problem size descriptor
cutlass::conv::Conv2dProblemSize problem_size(
    input_size,
    filter_size,
    padding,
    conv_stride,
    dilation,
    output_size,
    mode,
    split_k_slices);

// Create kernel arguments
ElementComputeEpilogue alpha = 1.0f;
ElementComputeEpilogue beta = 0.0f;

typename ImplicitGemm::Arguments arguments{
  problem_size,
  tensor_a.device_ref(),
  tensor_b.device_ref(),
  tensor_c.device_ref(),
  tensor_c.device_ref(),
  {alpha, beta},
};

// Initialize kernel
ImplicitGemm implicit_gemm_op;

size_t workspace_size = implicit_gemm_op.get_workspace_size(arguments);
cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);

cutlass::Status status = implicit_gemm_op.can_implement(arguments);
if (status != cutlass::Status::kSuccess) {
  std::cerr << "Kernel cannot implement this problem" << std::endl;
  return -1;
}

status = implicit_gemm_op.initialize(arguments, workspace.get());
if (status != cutlass::Status::kSuccess) {
  std::cerr << "Failed to initialize kernel" << std::endl;
  return -1;
}

// Launch kernel
status = implicit_gemm_op();
if (status != cutlass::Status::kSuccess) {
  std::cerr << "Kernel execution failed" << std::endl;
  return -1;
}
22
Verify correctness
23
Compare against a reference implementation:
24
// Compute reference
cutlass::reference::host::Conv2dFprop<
  ElementInputA, LayoutInputA,
  ElementInputB, LayoutInputB,
  ElementOutput, LayoutOutput,
  ElementComputeEpilogue,
  ElementAccumulator,
  ElementOutput,
  cutlass::NumericConverterClamp<ElementOutput, ElementComputeEpilogue>
>(
  problem_size,
  tensor_a.host_ref(),
  tensor_b.host_ref(),
  tensor_c.host_ref(),
  tensor_ref_c.host_ref(),
  alpha,
  beta
);

// Compare results
tensor_c.sync_host();

bool passed = cutlass::reference::host::TensorEquals(
  tensor_c.host_view(),
  tensor_ref_c.host_view());

if (passed) {
  std::cout << "Passed." << std::endl;
} else {
  std::cout << "Failed." << std::endl;
}

Building and Running

Build the example

cd /path/to/cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS='75;80;86'
make 09_turing_tensorop_conv2dfprop

Run the example

# Run with default parameters
./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop

# Run with custom dimensions
./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop \
  --n=1 --h=224 --w=224 --c=32 --k=32 --r=3 --s=3

# Run with reference check
./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop \
  --n=32 --h=224 --w=224 --c=128 --k=256 --r=1 --s=1 --ref-check

# Benchmark mode
./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop --benchmark

Command-line options

  • --n=<int>: Batch size (N)
  • --h=<int>: Input height (H)
  • --w=<int>: Input width (W)
  • --c=<int>: Input channels (C)
  • --k=<int>: Output channels (K)
  • --r=<int>: Filter height (R)
  • --s=<int>: Filter width (S)
  • --alpha=<float>: Scaling factor alpha
  • --beta=<float>: Scaling factor beta
  • --ref-check: Enable reference verification
  • --perf-check: Enable performance measurement
  • --benchmark: Run performance benchmarks on multiple layer configurations
  • --iterations=<int>: Number of profiling iterations

Source Code Location

The complete source code for this example is available at:
  • examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu

What This Example Demonstrates

  1. Implicit GEMM convolution: How CUTLASS transforms convolution into matrix multiplication
  2. Tensor Core usage: Leveraging int4 Tensor Core operations for high performance
  3. Hierarchical tiling: Composing threadblock, warp, and instruction-level tiles
  4. Pipeline optimization: Using multiple stages to hide memory latency
  5. NHWC layout: Working with channel-last tensor format
  6. Alignment requirements: Handling int4 data type alignment constraints

Performance Optimization

Key factors for optimal performance:

Tile Sizes

  • Threadblock: 128×128×128 balances occupancy and reuse
  • Warp: 64×64×128 fits well within register file limits
  • Instruction: 8×8×32 matches Tensor Core operation size

Pipeline Stages

  • NumStages=2 overlaps global memory loads with computation
  • Higher values may improve performance but increase shared memory usage

Alignment

  • int4 operations require 32-element (128-bit) alignment
  • Input channels (C) and output channels (K) must be divisible by 32

Benchmark Results

The --benchmark mode tests various convolution layer configurations typical in ResNet:
./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop --benchmark
Example layers tested:
  • 56×56×64 → 56×56×256 (1×1 convolution)
  • 56×56×64 → 56×56×64 (3×3 convolution)
  • 28×28×128 → 28×28×512 (1×1 convolution)
  • And many more…

Key Takeaways

  • CUTLASS implements convolution as implicit GEMM for maximum performance
  • Tensor Cores provide massive acceleration for low-precision convolution
  • Hierarchical tiling (threadblock → warp → instruction) enables efficient data reuse
  • Pipeline stages overlap data movement with computation to hide latency
  • Proper alignment is critical for int4 Tensor Core operations
  • NHWC layout is optimal for modern GPU architectures

Next Steps

  • Learn about Basic GEMM to understand the underlying GEMM operation
  • Explore Fused Operations to combine convolution with activations
  • Check out examples/16_ampere_tensorop_conv2dfprop/ for Ampere-specific optimizations
  • See examples/42_ampere_tensorop_group_conv/ for grouped convolutions

Build docs developers (and LLMs) love