Convolution Example
This example demonstrates how to run 2D convolution kernels using CUTLASS with Tensor Core acceleration on NVIDIA Turing GPUs.Overview
CUTLASS implements convolution as an implicit GEMM operation, transforming the convolution into a matrix multiplication that can leverage highly optimized GEMM kernels and Tensor Cores. The convolution operation computes:Input: NHWC layout (batch, height, width, channels)Filter: KRSC layout (output channels, filter height, filter width, input channels)Output: NPQK layout (batch, output height, output width, output channels)
Key Concepts
- Implicit GEMM: Convolution implemented as matrix multiplication
- Tensor Cores: Using specialized hardware for int4 operations
- Tile sizes: Hierarchical tiling at threadblock, warp, and instruction levels
- Pipeline stages: Overlapping data movement with computation
- NHWC layout: Optimized tensor layout for modern GPUs
Implementation
#include "cutlass/conv/kernel/default_conv2d_fprop.h"
#include "cutlass/conv/device/implicit_gemm_convolution.h"
// Data types
using ElementAccumulator = int32_t; // Accumulator type
using ElementComputeEpilogue = float; // Epilogue computation (alpha, beta)
using ElementInputA = cutlass::int4b_t; // Input tensor (activations)
using ElementInputB = cutlass::int4b_t; // Filter tensor (weights)
using ElementOutput = cutlass::int4b_t; // Output tensor
// Tensor layouts (NHWC format)
using LayoutInputA = cutlass::layout::TensorNHWC;
using LayoutInputB = cutlass::layout::TensorNHWC;
using LayoutOutput = cutlass::layout::TensorNHWC;
// Use Tensor Cores
using MMAOp = cutlass::arch::OpClassTensorOp;
// Target Turing architecture (SM75)
using SmArch = cutlass::arch::Sm75;
// Threadblock tile: 128x128x128 (MxNxK)
using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 128>;
// Warp tile: 64x64x128 (MxNxK)
using WarpShape = cutlass::gemm::GemmShape<64, 64, 128>;
// Tensor Core instruction shape: 8x8x32 (MxNxK)
using InstructionShape = cutlass::gemm::GemmShape<8, 8, 32>;
// Threadblock swizzling
using SwizzleThreadBlock = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>;
// Number of pipeline stages (for hiding latency)
constexpr int NumStages = 2;
using EpilogueOp = cutlass::epilogue::thread::LinearCombinationClamp<
ElementOutput, // Output data type
8, // Elements per vector access
ElementAccumulator, // Accumulator type
ElementComputeEpilogue>; // Alpha/beta type
using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop<
ElementInputA, LayoutInputA,
ElementInputB, LayoutInputB,
ElementOutput, LayoutOutput,
ElementAccumulator,
MMAOp,
SmArch,
ThreadblockShape,
WarpShape,
InstructionShape,
EpilogueOp,
SwizzleThreadBlock,
NumStages,
cutlass::arch::OpMultiplyAddSaturate,
cutlass::conv::IteratorAlgorithm::kAnalytic
>::Kernel;
using ImplicitGemm = cutlass::conv::device::ImplicitGemmConvolution<Conv2dFpropKernel>;
// Input dimensions: NHWC (batch, height, width, channels)
cutlass::Tensor4DCoord input_size(1, 32, 32, 32); // N=1, H=32, W=32, C=32
// Filter dimensions: KRSC (output channels, height, width, input channels)
cutlass::Tensor4DCoord filter_size(32, 3, 3, 32); // K=32, R=3, S=3, C=32
// Padding: (top, bottom, left, right)
cutlass::Tensor4DCoord padding(1, 1, 1, 1);
// Stride and dilation
cutlass::MatrixCoord conv_stride(1, 1);
cutlass::MatrixCoord dilation(1, 1);
// Compute output dimensions
int output_h = (input_size.h() + padding.n() + padding.h() - filter_size.h()) / conv_stride.row() + 1;
int output_w = (input_size.w() + padding.w() + padding.c() - filter_size.w()) / conv_stride.column() + 1;
cutlass::Tensor4DCoord output_size(input_size.n(), output_h, output_w, filter_size.n());
// Allocate tensors
cutlass::HostTensor<ElementInputA, LayoutInputA> tensor_a(input_size);
cutlass::HostTensor<ElementInputB, LayoutInputB> tensor_b(filter_size);
cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c(output_size);
cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_c(output_size);
// Fill with random data
cutlass::reference::host::TensorFillRandomUniform(
tensor_a.host_view(), 1, ElementInputA(7), ElementInputA(-8), 0);
cutlass::reference::host::TensorFillRandomUniform(
tensor_b.host_view(), 1, ElementInputB(7), ElementInputB(-8), 0);
cutlass::reference::host::TensorFill(tensor_c.host_view());
cutlass::reference::host::TensorFill(tensor_ref_c.host_view());
// Copy to device
tensor_a.sync_device();
tensor_b.sync_device();
tensor_c.sync_device();
tensor_ref_c.sync_device();
// Convolution mode (cross-correlation vs convolution)
cutlass::conv::Mode mode = cutlass::conv::Mode::kCrossCorrelation;
// Split-K dimension (parallel reduction)
int split_k_slices = 1;
// Create problem size descriptor
cutlass::conv::Conv2dProblemSize problem_size(
input_size,
filter_size,
padding,
conv_stride,
dilation,
output_size,
mode,
split_k_slices);
// Create kernel arguments
ElementComputeEpilogue alpha = 1.0f;
ElementComputeEpilogue beta = 0.0f;
typename ImplicitGemm::Arguments arguments{
problem_size,
tensor_a.device_ref(),
tensor_b.device_ref(),
tensor_c.device_ref(),
tensor_c.device_ref(),
{alpha, beta},
};
// Initialize kernel
ImplicitGemm implicit_gemm_op;
size_t workspace_size = implicit_gemm_op.get_workspace_size(arguments);
cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
cutlass::Status status = implicit_gemm_op.can_implement(arguments);
if (status != cutlass::Status::kSuccess) {
std::cerr << "Kernel cannot implement this problem" << std::endl;
return -1;
}
status = implicit_gemm_op.initialize(arguments, workspace.get());
if (status != cutlass::Status::kSuccess) {
std::cerr << "Failed to initialize kernel" << std::endl;
return -1;
}
// Launch kernel
status = implicit_gemm_op();
if (status != cutlass::Status::kSuccess) {
std::cerr << "Kernel execution failed" << std::endl;
return -1;
}
// Compute reference
cutlass::reference::host::Conv2dFprop<
ElementInputA, LayoutInputA,
ElementInputB, LayoutInputB,
ElementOutput, LayoutOutput,
ElementComputeEpilogue,
ElementAccumulator,
ElementOutput,
cutlass::NumericConverterClamp<ElementOutput, ElementComputeEpilogue>
>(
problem_size,
tensor_a.host_ref(),
tensor_b.host_ref(),
tensor_c.host_ref(),
tensor_ref_c.host_ref(),
alpha,
beta
);
// Compare results
tensor_c.sync_host();
bool passed = cutlass::reference::host::TensorEquals(
tensor_c.host_view(),
tensor_ref_c.host_view());
if (passed) {
std::cout << "Passed." << std::endl;
} else {
std::cout << "Failed." << std::endl;
}
Building and Running
Build the example
Run the example
Command-line options
--n=<int>: Batch size (N)--h=<int>: Input height (H)--w=<int>: Input width (W)--c=<int>: Input channels (C)--k=<int>: Output channels (K)--r=<int>: Filter height (R)--s=<int>: Filter width (S)--alpha=<float>: Scaling factor alpha--beta=<float>: Scaling factor beta--ref-check: Enable reference verification--perf-check: Enable performance measurement--benchmark: Run performance benchmarks on multiple layer configurations--iterations=<int>: Number of profiling iterations
Source Code Location
The complete source code for this example is available at:examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu
What This Example Demonstrates
- Implicit GEMM convolution: How CUTLASS transforms convolution into matrix multiplication
- Tensor Core usage: Leveraging int4 Tensor Core operations for high performance
- Hierarchical tiling: Composing threadblock, warp, and instruction-level tiles
- Pipeline optimization: Using multiple stages to hide memory latency
- NHWC layout: Working with channel-last tensor format
- Alignment requirements: Handling int4 data type alignment constraints
Performance Optimization
Key factors for optimal performance:Tile Sizes
- Threadblock: 128×128×128 balances occupancy and reuse
- Warp: 64×64×128 fits well within register file limits
- Instruction: 8×8×32 matches Tensor Core operation size
Pipeline Stages
NumStages=2overlaps global memory loads with computation- Higher values may improve performance but increase shared memory usage
Alignment
- int4 operations require 32-element (128-bit) alignment
- Input channels (C) and output channels (K) must be divisible by 32
Benchmark Results
The--benchmark mode tests various convolution layer configurations typical in ResNet:
- 56×56×64 → 56×56×256 (1×1 convolution)
- 56×56×64 → 56×56×64 (3×3 convolution)
- 28×28×128 → 28×28×512 (1×1 convolution)
- And many more…
Key Takeaways
- CUTLASS implements convolution as implicit GEMM for maximum performance
- Tensor Cores provide massive acceleration for low-precision convolution
- Hierarchical tiling (threadblock → warp → instruction) enables efficient data reuse
- Pipeline stages overlap data movement with computation to hide latency
- Proper alignment is critical for int4 Tensor Core operations
- NHWC layout is optimal for modern GPU architectures
Next Steps
- Learn about Basic GEMM to understand the underlying GEMM operation
- Explore Fused Operations to combine convolution with activations
- Check out
examples/16_ampere_tensorop_conv2dfprop/for Ampere-specific optimizations - See
examples/42_ampere_tensorop_group_conv/for grouped convolutions