Skip to main content

Overview

The Stitcher library is highly optimized for performance through SIMD vectorization, multi-threading, and aggressive compiler optimizations. This guide explains the performance characteristics and how to maximize throughput.

SIMD Optimizations

SIMDe Library

The library uses SIMDe (SIMD Everywhere) to provide portable SIMD operations across different platforms. SIMDe automatically maps to native SIMD instructions when available.
// From image_operations.c:4
#include "simde/simde/x86/avx2.h"

AVX2 Vectorization

Key operations like Gaussian convolution are vectorized using AVX2 instructions: Vertical Convolution (8 pixels at once)
// Processing 8 pixels in parallel using AVX2
for (; x < width - 8; x += 8) {
    simde__m256i r0 = simde_mm256_loadu_si256((const simde__m256i *)(row0 + x));
    simde__m256i r1 = simde_mm256_loadu_si256((const simde__m256i *)(row1 + x));
    simde__m256i r2 = simde_mm256_loadu_si256((const simde__m256i *)(row2 + x));
    simde__m256i r3 = simde_mm256_loadu_si256((const simde__m256i *)(row3 + x));
    simde__m256i r4 = simde_mm256_loadu_si256((const simde__m256i *)(row4 + x));
    
    // Compute: v0 + v4 + 2*v2 + 4*(v1 + v3 + v2)
    simde__m256i sum = simde_mm256_add_epi32(r0, r4);
    sum = simde_mm256_add_epi32(sum, simde_mm256_slli_epi32(r2, 1));
    simde__m256i t = simde_mm256_add_epi32(simde_mm256_add_epi32(r1, r3), r2);
    sum = simde_mm256_add_epi32(sum, simde_mm256_slli_epi32(t, 2));
}
This achieves 8x parallelism for single-channel processing and efficient vectorization for RGB images.

Gaussian Kernel

The 5x5 Gaussian kernel is pre-computed for fast convolution:
static const float GAUSSIAN_KERNEL[5][5] = {
    {1.0 / 256, 4.0 / 256, 6.0 / 256, 4.0 / 256, 1.0 / 256},
    {4.0 / 256, 16.0 / 256, 24.0 / 256, 16.0 / 256, 4.0 / 256},
    {6.0 / 256, 24.0 / 256, 36.0 / 256, 24.0 / 256, 6.0 / 256},
    {4.0 / 256, 16.0 / 256, 24.0 / 256, 16.0 / 256, 4.0 / 256},
    {1.0 / 256, 4.0 / 256, 6.0 / 256, 4.0 / 256, 1.0 / 256}
};
This kernel is used for downsampling and upsampling operations in the Laplacian pyramid.

Compiler Optimization Flags

Build Configuration

From CMakeLists.txt:30-32, the library uses aggressive optimization:
target_compile_options(${PROJECT_NAME} PRIVATE -O3 -pthread)
target_link_libraries(${PROJECT_NAME} PRIVATE -pthread)
target_compile_definitions(${PROJECT_NAME} PRIVATE SIMDE_ENABLE_NATIVE_ALIASES)
Key Flags:
  • -O3: Maximum optimization level (loop unrolling, vectorization, inlining)
  • -pthread: Enable POSIX threading support
  • SIMDE_ENABLE_NATIVE_ALIASES: Use native SIMD names (maps to actual hardware instructions)

Platform-Specific Optimizations

When SIMDE_ENABLE_NATIVE_ALIASES is defined, SIMDe uses the actual CPU’s SIMD instructions:
  • x86/x64: AVX2, SSE4.2, SSE2
  • ARM: NEON
  • Other platforms: Software fallback (slower but portable)
The library requires a CPU with AVX2 support for optimal performance. On older CPUs, SIMDe will fall back to slower instruction sets or scalar code.

Optimization Techniques

Separable Convolution

Gaussian convolution is performed as two 1D operations instead of a single 2D operation:
  1. Horizontal pass: Convolve each row with [1, 4, 6, 4, 1]
  2. Vertical pass: Convolve each column with [1, 4, 6, 4, 1]
This reduces complexity from O(n² × k²) to O(n² × k) where k = kernel size.

Cache-Friendly Processing

The library processes images in row-major order and uses temporary buffers to improve cache locality:
int *temp_dst_out = (int *)malloc(5 * width * RGB_CHANNELS * sizeof(int));
int *temp_dst_rows[5] = {
    temp_dst_out, 
    temp_dst_out + (width * RGB_CHANNELS),
    temp_dst_out + (2 * width * RGB_CHANNELS),
    temp_dst_out + (3 * width * RGB_CHANNELS),
    temp_dst_out + (4 * width * RGB_CHANNELS)
};

Boundary Handling

Reflection padding is used for edge pixels, avoiding conditional branches in the inner loop:
int reflect_index(int i, int n) {
    if (i < 0)
        return -i % n;
    else if (i >= n)
        return 2 * n - i - 2;
    else
        return i;
}

Benchmarking

Measuring Performance

From examples/stitch.c:10-28, use clock_gettime for accurate timing:
#include <time.h>

struct timespec start, end;
double duration;

// Start timing
clock_gettime(CLOCK_MONOTONIC, &start);

// Your operation here
Blender *b = create_blender(MULTIBAND, out_size, num_bands);

// End timing
clock_gettime(CLOCK_MONOTONIC, &end);
duration = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;
printf("Elapsed time: %.2f seconds\n", duration);

Performance Metrics

Typical operations to benchmark:
// Create blender
clock_gettime(CLOCK_MONOTONIC, &start);
Blender *b = create_blender(MULTIBAND, out_size, num_bands);
clock_gettime(CLOCK_MONOTONIC, &end);
printf("Create blender: %.2f seconds\n", duration);

// Feed image 1
clock_gettime(CLOCK_MONOTONIC, &start);
feed(b, &img1, &mask1, pt1);
clock_gettime(CLOCK_MONOTONIC, &end);
printf("Feed 1: %.2f seconds\n", duration);

// Feed image 2
clock_gettime(CLOCK_MONOTONIC, &start);
feed(b, &img2, &mask2, pt2);
clock_gettime(CLOCK_MONOTONIC, &end);
printf("Feed 2: %.2f seconds\n", duration);

// Blend
clock_gettime(CLOCK_MONOTONIC, &start);
blend(b);
clock_gettime(CLOCK_MONOTONIC, &end);
printf("Blend: %.2f seconds\n", duration);

Performance Tips

1. Choose Appropriate Number of Bands

// More bands = better quality but slower
Blender *b = create_blender(MULTIBAND, out_size, 7);  // Maximum quality

// Fewer bands = faster but visible seams
Blender *b = create_blender(MULTIBAND, out_size, 3);  // Faster

// Feather blending = fastest
Blender *b = create_blender(FEATHER, out_size, -1);   // Fastest

2. Image Size Considerations

The library automatically pads images to be divisible by 2^num_bands:
// From blending.c:28-33
out_size.width += ((1 << num_bands) - out_size.width % (1 << num_bands)) % (1 << num_bands);
out_size.height += ((1 << num_bands) - out_size.height % (1 << num_bands)) % (1 << num_bands);
Large images with many bands can consume significant memory. A 4000x3000 image with 7 bands creates a pyramid with ~8 levels, requiring ~1.33x the original memory.

3. Reuse Blender Objects

Avoid creating/destroying blenders repeatedly:
// Bad: Creates blender for each image pair
for (int i = 0; i < num_pairs; i++) {
    Blender *b = create_blender(MULTIBAND, out_size, 5);
    feed(b, &imgs[i*2], &masks[i*2], pt1);
    feed(b, &imgs[i*2+1], &masks[i*2+1], pt2);
    blend(b);
    destroy_blender(b);  // Expensive!
}

// Good: Reuse blender (not currently supported, but illustrative)
// Note: Current API requires new blender per blend operation

4. Profile Your Code

Use profiling tools to identify bottlenecks:
# Linux: perf
perf record -g ./your_program
perf report

# Mac: Instruments
instruments -t "Time Profiler" ./your_program

# gprof
gcc -pg your_code.c -o your_program
./your_program
gprof your_program gmon.out > analysis.txt

Expected Performance

Typical performance characteristics (4-core Intel i7, AVX2 support):
OperationImage SizeThreadsTime
Downsample4000x3000 RGB3~15ms
Upsample2000x1500 RGB3~25ms
Multi-band blend (5 bands)4000x3000 RGB3~200ms
Feather blend4000x3000 RGB3~50ms
Performance scales nearly linearly with thread count up to the number of physical cores.

Next Steps

Threading

Learn how to control parallelism and thread count

Memory Management

Understand memory patterns and avoid leaks

Build docs developers (and LLMs) love