Skip to main content

Overview

nrvna-ai uses a multi-threaded architecture to process multiple LLM inference jobs concurrently. The design separates job discovery, queue management, and execution into distinct threads with minimal synchronization overhead.
The threading model leverages atomic filesystem operations instead of mutexes wherever possible, reducing lock contention and complexity.

Thread Architecture

Main Thread

    ├── creates Server
    │       │
    │       ├── creates Scanner (1 thread)
    │       ├── creates Pool (N worker threads)
    │       └── creates Processor (shared, thread-safe)

    └── waits for shutdown signal (SIGINT, SIGTERM)

Scanner Thread
    └── loops every 1 second
        └── scans input/ready/
        └── submits jobs to Pool queue

Worker Threads (N = NRVNA_WORKERS, default 4)
    ├── Worker-0
    ├── Worker-1  
    ├── Worker-2
    └── Worker-N
        └── wait on condition variable
        └── pop job from queue
        └── call Processor::process(job_id, worker_id)
        └── each has dedicated Runner instance
Responsibilities:
  • Initialize Server and components
  • Create workspace directory structure
  • Recover orphaned jobs from previous crashes
  • Start Scanner thread and Worker threads
  • Wait for shutdown signal
  • Coordinate graceful shutdown
Thread name: Main
int main(int argc, char* argv[]) {
    Server server(model_path, workspace, num_workers);
    
    if (!server.start()) {
        return 1;
    }
    
    // Wait for SIGINT or SIGTERM
    signal(SIGINT, signalHandler);
    signal(SIGTERM, signalHandler);
    
    while (!shutdown_requested) {
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }
    
    server.shutdown();
    return 0;
}

Thread Count Configuration

export NRVNA_WORKERS=8
nrvnad model.gguf workspace
Choosing thread count:

CPU-bound models

Set workers = CPU cores (e.g., 4-8 for typical machines)

GPU-accelerated models

More workers can help (8-16) as inference happens on GPU

Memory constraints

Each worker needs context memory. Reduce if OOM errors occur.

I/O-bound tasks

More workers help with filesystem I/O parallelism

Synchronization Mechanisms

Atomic Filesystem Operations

Most synchronization uses atomic directory renames instead of mutexes:
// NO MUTEX NEEDED - atomic rename guarantees only one thread succeeds
void Processor::process(const JobId& job_id) {
    auto ready = workspace_ / "input/ready" / job_id;
    auto processing = workspace_ / "processing" / job_id;
    
    try {
        std::filesystem::rename(ready, processing);  // ⚛️ ATOMIC
    } catch (const std::filesystem::filesystem_error&) {
        return;  // Another worker already claimed this job
    }
    
    // We now own this job exclusively...
}
Benefits:
  • Lock-free job claiming
  • No contention between workers
  • Simple, debuggable code
  • Crash-safe (no locks to deadlock)

Pool Queue Mutex

The only mutex in the system protects the job queue in Pool:
class Pool {
private:
    std::queue<JobId> jobs_;           // Job queue
    std::mutex queueMutex_;            // Protects jobs_
    std::condition_variable condition_; // Notifies workers
};
Operations:
void Pool::submit(const JobId& job_id) {
    std::lock_guard<std::mutex> lock(queueMutex_);
    jobs_.push(job_id);
    condition_.notify_one();  // Wake up one worker
}
The mutex is held only during queue manipulation. Actual job processing happens outside the lock, so workers don’t block each other.

Atomic Flags

Simple boolean flags use std::atomic for lock-free reads/writes:
class Server {
private:
    std::atomic<bool> running_{false};   // Server state
    std::atomic<bool> shutdown_{false};  // Shutdown signal
};

// Check running state (no lock needed)
bool Server::isRunning() const noexcept {
    return running_.load();
}

// Signal shutdown (no lock needed)
void Server::shutdown() noexcept {
    shutdown_.store(true);
}

Memory Model

Shared Model, Per-Thread Context

class Processor {
private:
    std::shared_ptr<Runner> runner_;  // Shared across workers
};

class Runner {
private:
    llama_model* model_;              // ✅ Shared (read-only)
    std::vector<llama_context*> contexts_;  // ✅ Per-worker contexts
};
Why this design:
1

Model is expensive to load

Loading a 7B model takes ~4GB RAM and several seconds. Load once, share across all workers.
2

Context is mutable

llama.cpp’s llama_context maintains inference state (KV cache, RNG). Each worker needs its own.
3

Parallel inference

Multiple workers can run inference simultaneously, each with their own context but sharing the model weights.

Runner Initialization

Runner::Runner(const std::string& model_path, int num_workers) {
    // Load model once (expensive)
    model_ = llama_load_model_from_file(model_path.c_str(), params_);
    
    // Create per-worker contexts
    for (int i = 0; i < num_workers; ++i) {
        auto* ctx = llama_new_context_with_model(model_, ctx_params_);
        contexts_.push_back(ctx);
    }
}

std::string Runner::run(const std::string& prompt, int worker_id) {
    auto* ctx = contexts_[worker_id];  // Each worker uses its own context
    // ... inference using ctx ...
}
Memory usage: (Model size) + (Context size × num_workers)Example: 7B Q4_0 model (~4GB) + 4 workers × 2GB context = ~12GB total

Thread Safety Analysis

Thread-Safe Components

class Logger {
private:
    static std::mutex logMutex_;  // Protects all logging
public:
    static void log(Level level, const std::string& message) {
        std::lock_guard<std::mutex> lock(logMutex_);
        // ... write to stdout/stderr ...
    }
};
All logging macros are safe to call from any thread:
LOG_INFO("Worker-" + std::to_string(worker_id) + " processing job");
class Processor {
public:
    // Safe to call from multiple workers concurrently
    void process(const JobId& job_id, int worker_id);
};
Why it’s safe:
  • Uses atomic filesystem renames for job claiming
  • Each worker processes different job (exclusive ownership)
  • Shared Runner uses worker_id to select dedicated context
  • No shared mutable state
Scanner runs in dedicated thread, no concurrent access:
std::vector<JobId> Scanner::scan() {
    // Only called by Scanner thread
    // No synchronization needed
}
class Pool {
private:
    std::queue<JobId> jobs_;      // Protected by queueMutex_
    std::mutex queueMutex_;
    std::condition_variable condition_;
};
All queue operations are protected.

NOT Thread-Safe (By Design)

These components are designed for single-threaded use:
  • Work: Client creates instance per thread
  • Flow: Client creates instance per thread
  • Server: Single instance, start() called once from main thread
// ❌ WRONG - sharing Work across threads
Work work(workspace);
std::thread t1([&work] { work.submit("prompt 1"); });
std::thread t2([&work] { work.submit("prompt 2"); });

// ✅ CORRECT - each thread gets own Work instance
std::thread t1([] { Work work(workspace); work.submit("prompt 1"); });
std::thread t2([] { Work work(workspace); work.submit("prompt 2"); });

Graceful Shutdown

Shutdown sequence:
1

Main thread receives signal

void signalHandler(int sig) {
    shutdown_requested = true;
}
2

Server::shutdown() called

void Server::shutdown() noexcept {
    LOG_INFO("Shutting down server...");
    shutdown_.store(true);  // Signal all threads
    condition_.notify_all(); // Wake up all workers
}
3

Scanner thread exits

while (!shutdown_.load()) {  // Loop exits
    // ...
}
LOG_INFO("Scanner thread exiting");
4

Worker threads finish current jobs

void Pool::workerLoop(int worker_id) {
    while (!shutdown_.load()) {
        // ... process job ...
    }
    LOG_INFO("Worker-" + std::to_string(worker_id) + " exiting");
}
5

Server destructor joins threads

Server::~Server() {
    if (scannerThread_.joinable()) {
        scannerThread_.join();
    }
    // Pool destructor joins worker threads
}
In-progress jobs are interrupted. Jobs in processing/ will be moved back to ready/ on next server start.

Performance Considerations

Lock Contention

Low Contention

Pool queue mutex held briefly (queue operations only)

No Contention

Job processing uses lock-free atomic renames

Minimal Locking

Logger mutex only during log writes

Cache-Friendly

Workers operate on independent job directories

Context Switches

Workers sleep efficiently using condition variables:
condition_.wait(lock, [this] {
    return !jobs_.empty() || shutdown_.load();
});
  • No busy-waiting
  • No CPU spinning
  • Workers wake only when jobs available

Memory Locality

Each worker:
  • Has dedicated llama.cpp context (separate memory)
  • Processes different job directories (separate filesystem cache)
  • Minimal shared state (just the job queue)

Debugging Threading Issues

Thread Names

Set via platform-specific APIs:
void setThreadName(const std::string& name) {
#if defined(__linux__)
    pthread_setname_np(pthread_self(), name.c_str());
#elif defined(__APPLE__)
    pthread_setname_np(name.c_str());
#endif
}
Visible in logs:
[2026-01-12 10:30:45.123] [INFO ] [Scanner] Found 3 ready jobs
[2026-01-12 10:30:45.125] [INFO ] [Worker-0] Processing job: job_123
[2026-01-12 10:30:45.126] [INFO ] [Worker-1] Processing job: job_124

Debugging Tools

# List threads
pgrep nrvnad | xargs ps -T -p

# Expect: 1 main + 1 scanner + N workers

Best Practices

Match CPU Cores

Set NRVNA_WORKERS to match your CPU core count for CPU-bound inference.

Monitor Memory

Watch memory usage: (model size) + (context × workers). Reduce workers if OOM.

Don't Share Client APIs

Create separate Work/Flow instances per thread. They’re not thread-safe.

Let Jobs Complete

Avoid SIGKILL. Use SIGINT/SIGTERM for graceful shutdown.

See Also

Architecture

Overall system design

Filesystem Queue

Lock-free queue implementation

Configuration

Environment variables including NRVNA_WORKERS

Server API

Server class reference

Build docs developers (and LLMs) love