Multi-Threading Architecture

Overview

The Stitcher library uses POSIX threads (pthreads) to parallelize computationally intensive operations. Most image processing operations automatically use multiple threads without requiring manual thread management.

Thread Architecture

Core Threading Components

The library defines several key structures for thread management in image_operations.h:113-124:

typedef struct {
    int start_index;       // Starting row/element for this thread
    int end_index;         // Ending row/element (exclusive)
    WorkerThreadArgs *workerThreadArgs;  // Operation-specific data
} ThreadArgs;

typedef struct {
    int rows;              // Total rows/elements to process
    WorkerThreadArgs *workerThreadArgs;  // Shared work data
} ParallelOperatorArgs;

Parallel Operator Function

The parallel_operator function (blending.c:540) is the core threading primitive:

void parallel_operator(OperatorType operatorType, ParallelOperatorArgs *arg);

Supported Operations:

DOWNSAMPLE - Gaussian pyramid downsampling
UPSAMPLE - Gaussian pyramid upsampling
LAPLACIAN - Laplacian computation (subtraction)
FEED - Feed image into blending pyramid
BLEND - Blend pyramid levels
NORMALIZE - Normalize by accumulated weights

Thread Count Control

Automatic Thread Determination

The library automatically determines the optimal thread count using get_cpus_count() from utils.c:21-24:

int get_cpus_count() {
    return (get_no_of_cpu() / 2) + 1;
}

Thread Count Formula:

threads = (physical_cores / 2) + 1

Examples:

4-core CPU → 3 threads
8-core CPU → 5 threads
16-core CPU → 9 threads

This formula accounts for hyperthreading and leaves resources for system processes. It provides good performance without oversubscribing the CPU.

Platform-Specific CPU Detection

From utils.c:8-19, the library detects CPU count across platforms:

int get_no_of_cpu() {
#if defined(_WIN32) || defined(_WIN64)
    SYSTEM_INFO sysinfo;
    GetSystemInfo(&sysinfo);
    return sysinfo.dwNumberOfProcessors;
#elif defined(_SC_NPROCESSORS_ONLN)
    return (int)sysconf(_SC_NPROCESSORS_ONLN);
#else
    return 1;  // Fallback to single-threaded
#endif
}

If CPU detection fails, the library falls back to single-threaded execution. This may result in significantly slower performance.

Work Distribution

Row-Based Parallelism

The parallel_operator function distributes work by rows (blending.c:541-553):

int numThreads = get_cpus_count();
int rowsPerThread = arg->rows / numThreads;
int remainingRows = arg->rows % numThreads;

pthread_t threads[numThreads];
ThreadArgs thread_data[numThreads];

int startRow = 0;
for (unsigned int i = 0; i < numThreads; ++i) {
    int endRow = startRow + rowsPerThread + (remainingRows > 0 ? 1 : 0);
    if (remainingRows > 0) {
        --remainingRows;
    }
    // ... create thread ...
    startRow = endRow;
}

Load Balancing:

Base rows per thread: total_rows / num_threads
Extra rows distributed one per thread to first N threads
Ensures even distribution (difference ≤ 1 row between threads)

Example: 1000 rows, 3 threads

Thread 0: rows 0-334 (334 rows)
Thread 1: rows 334-667 (333 rows)
Thread 2: rows 667-1000 (333 rows)

Thread Synchronization

All threads are joined before returning (blending.c:625-627):

for (unsigned int i = 0; i < numThreads; ++i) {
    pthread_join(threads[i], NULL);
}

This ensures all work completes before the function returns.

Worker Thread Patterns

1. Downsampling Worker

From image_operations.c:382-431:

void *down_sample_operation(void *args) {
    ThreadArgs *arg = (ThreadArgs *)args;
    int start_row = arg->start_index;
    int end_row = arg->end_index;
    SamplingThreadData *data = (SamplingThreadData *)arg->workerThreadArgs->std;
    
    // Process rows start_row to end_row
    for (int y = start_row; y < end_row; ++y) {
        // Gaussian convolution on this row
    }
    
    return NULL;
}

2. Feed Worker

From blending.c:182-229:

void *feed_worker(void *args) {
    ThreadArgs *arg = (ThreadArgs *)args;
    int start_row = arg->start_index;
    int end_row = arg->end_index;
    FeedThreadData *f = (FeedThreadData *)arg->workerThreadArgs->ftd;
    
    for (int k = start_row; k < end_row; ++k) {
        for (int i = 0; i < f->cols; ++i) {
            // Accumulate weighted pixels
            for (char z = 0; z < RGB_CHANNELS; ++z) {
                float maskVal = f->mask_gaussian[f->level].data[maskIndex];
                float imgVal = f->img_laplacians[f->level].data[imgIndex];
                f->out[f->level].data[outLevelIndex] += imgVal * maskVal;
            }
        }
    }
    
    return NULL;
}

3. Normalize Worker

From blending.c:418-442:

void *normalize_worker(void *args) {
    ThreadArgs *arg = (ThreadArgs *)args;
    int start_row = arg->start_index;
    int end_row = arg->end_index;
    NormalThreadData *n = (NormalThreadData *)arg->workerThreadArgs->ntd;
    
    for (int y = start_row; y < end_row; ++y) {
        for (int x = 0; x < n->output_width; ++x) {
            float w = n->out_mask[n->level].data[maskIndex];
            for (char z = 0; z < RGB_CHANNELS; z++) {
                // Divide by accumulated weight
                n->final_out[n->level].data[imgIndex] = 
                    (short)(n->out[n->level].data[imgIndex] / (w + WEIGHT_EPS));
            }
        }
    }
    
    return NULL;
}

Thread Safety

Write Separation

Each thread writes to non-overlapping memory regions, eliminating the need for locks:

// Thread 0 writes to rows 0-333
// Thread 1 writes to rows 334-666
// Thread 2 writes to rows 667-999
// No overlapping writes → no race conditions

Read-Only Shared Data

Input images and masks are read-only across all threads:

// Safely shared between threads (read-only)
ImageS *img_laplacians;      // Read by all threads
ImageS *mask_gaussian;       // Read by all threads

// Each thread writes to its own row range
ImageF *out;                 // Written by threads (non-overlapping)
ImageF *out_mask;            // Written by threads (non-overlapping)

Do NOT modify input images or masks during processing. This will cause race conditions and undefined behavior.

Performance Considerations

Overhead vs. Benefit

Thread creation has overhead. Small images may not benefit from parallelism:

// Small image: overhead dominates
Image tiny = create_empty_image(100, 100, 3);
Image down = downsample(&tiny);  // 3 threads for 50 rows each
// Thread overhead may exceed computation time!

// Large image: parallelism wins
Image large = create_empty_image(4000, 3000, 3);
Image down = downsample(&large);  // 3 threads for 750 rows each
// Significant speedup from parallelism

Scaling Efficiency

Parallel efficiency depends on image size:

Image Size	Rows per Thread	Efficiency
100x100	~16	Poor (~1.2x)
1000x1000	~166	Good (~2.5x)
4000x3000	~1000	Excellent (~2.9x)

Efficiency is measured relative to single-threaded execution with 3 threads. Theoretical maximum is 3x.

Common Patterns

Pattern 1: Automatic Parallelism

Most operations automatically use threads:

// These all use parallel_operator internally
Image down = downsample(&img);              // Parallelized
Image up = upsample(&img, 4.0f);           // Parallelized
feed(blender, &img, &mask, tl);            // Parallelized
blend(blender);                             // Parallelized

Pattern 2: Sequential Operations with Parallel Steps

Multi-band blending uses sequential pyramid construction with parallel steps:

// From blending.c:280-294
for (int j = 0; j < num_bands; ++j) {
    // Step 1: Parallel downsample
    images[j + 1] = downsample_s(&images[j]);          // Parallel
    
    // Step 2: Parallel upsample
    img_laplacians[j] = upsample_image_s(&images[j + 1], 4.f);  // Parallel
    
    // Step 3: Parallel Laplacian computation
    compute_laplacian(&images[j], &img_laplacians[j]); // Parallel
}

Each pyramid level is processed in sequence, but each operation within uses threads.

Pattern 3: Pyramid Level Processing

From blending.c:318-348, each pyramid level is fed in parallel:

for (int level = 0; level <= num_bands; ++level) {
    FeedThreadData ftd;
    ftd.rows = (y_br - y_tl);
    ftd.cols = (x_br - x_tl);
    ftd.level = level;
    // ... setup data ...
    
    WorkerThreadArgs wtd;
    wtd.ftd = &ftd;
    ParallelOperatorArgs args = {rows, &wtd};
    
    parallel_operator(FEED, &args);  // Parallel feed at this level
    
    // Adjust for next level (half resolution)
    x_tl /= 2;
    y_tl /= 2;
    x_br /= 2;
    y_br /= 2;
}

Debugging Threading Issues

Enable Thread Sanitizer

Compile with ThreadSanitizer to detect race conditions:

target_compile_options(${PROJECT_NAME} PRIVATE -fsanitize=thread -g)
target_link_options(${PROJECT_NAME} PRIVATE -fsanitize=thread)

Print Thread Information

Add debug output to track thread execution:

void *my_worker(void *args) {
    ThreadArgs *arg = (ThreadArgs *)args;
    printf("Thread processing rows %d to %d\n", 
           arg->start_index, arg->end_index);
    // ... do work ...
    return NULL;
}

Test Single-Threaded

Temporarily force single-threaded execution:

// In utils.c
int get_cpus_count() {
    return 1;  // Force single thread for debugging
}

Limitations

No Dynamic Thread Control

The thread count is determined at runtime and cannot be changed:

// Not supported:
// set_thread_count(8);  // No such function exists

// Thread count is automatic based on CPU
int threads = get_cpus_count();  // Read-only

No Custom Thread Pools

Threads are created and destroyed for each operation:

// Each call creates new threads
Image down1 = downsample(&img1);  // Creates 3 threads
Image down2 = downsample(&img2);  // Creates 3 NEW threads
// No thread pool reuse

This adds overhead but simplifies the API.

Stack Size Limitations

Default pthread stack size (usually 2-8 MB) is sufficient for typical images. Very large images may require increasing stack size:

pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 16 * 1024 * 1024);  // 16 MB
pthread_create(&thread, &attr, worker_func, args);

The current implementation does not expose stack size control. Extremely large images (>16K×16K) may cause stack overflow in worker threads.

Getting Started

Core Concepts

API Reference

Examples

Advanced

​Overview

​Thread Architecture

​Core Threading Components

​Parallel Operator Function

​Thread Count Control

​Automatic Thread Determination

​Platform-Specific CPU Detection

​Work Distribution

​Row-Based Parallelism

​Thread Synchronization

​Worker Thread Patterns

​1. Downsampling Worker

​2. Feed Worker

​3. Normalize Worker

​Thread Safety

​Write Separation

​Read-Only Shared Data

​Performance Considerations

​Overhead vs. Benefit

​Scaling Efficiency

​Common Patterns

​Pattern 1: Automatic Parallelism

​Pattern 2: Sequential Operations with Parallel Steps

​Pattern 3: Pyramid Level Processing

​Debugging Threading Issues

​Enable Thread Sanitizer

​Print Thread Information

​Test Single-Threaded

​Limitations

​No Dynamic Thread Control

​No Custom Thread Pools

​Stack Size Limitations

​Next Steps

Performance

Memory Management

Build docs developers (and LLMs) love

Overview

Thread Architecture

Core Threading Components

Parallel Operator Function

Thread Count Control

Automatic Thread Determination

Platform-Specific CPU Detection

Work Distribution

Row-Based Parallelism

Thread Synchronization

Worker Thread Patterns

1. Downsampling Worker

2. Feed Worker

3. Normalize Worker

Thread Safety

Write Separation

Read-Only Shared Data

Performance Considerations

Overhead vs. Benefit

Scaling Efficiency

Common Patterns

Pattern 1: Automatic Parallelism

Pattern 2: Sequential Operations with Parallel Steps

Pattern 3: Pyramid Level Processing

Debugging Threading Issues

Enable Thread Sanitizer

Print Thread Information

Test Single-Threaded

Limitations

No Dynamic Thread Control

No Custom Thread Pools

Stack Size Limitations

Next Steps