Job Lifecycle

Overview

In nrvna-ai, jobs move through a well-defined lifecycle represented by their location in the filesystem. The job’s directory path is the job’s state - no database or state tracking needed.

Job States

Every job exists in exactly one of five states at any given time:

STAGING - Being Created

Directory: input/writing/<job_id>/
Status enum: Not visible to system yet
Duration: Milliseconds (during job creation)Jobs begin here during the submission process. The writing/ directory acts as a staging area where jobs are assembled before becoming visible to the queue.What happens:

Work creates the job directory
Prompt is written to prompt.txt
Job is invisible to Scanner (not yet in ready/)
Atomic rename moves job to QUEUED state

Jobs in writing/ are incomplete and should never be processed. The atomic rename ensures workers never see half-written jobs.

QUEUED - Waiting for Processing

Directory: input/ready/<job_id>/
Status enum: Status::Queued
Duration: Variable (depends on queue depth and worker availability)Jobs wait here until a worker becomes available. This is the main queue for the inference system.What happens:

Scanner discovers job during periodic scan (every 1 second)
Job ID submitted to Pool’s work queue
Job waits for available worker thread
When worker picks up job, Processor atomically moves it to RUNNING

Multiple jobs can be queued simultaneously. They’re processed in the order discovered by the Scanner.

RUNNING - Inference in Progress

Directory: processing/<job_id>/
Status enum: Status::Running
Duration: Variable (depends on prompt length and model speed)Jobs are actively being processed by a worker thread.What happens:

Worker’s Processor atomically renames job from ready/ to processing/
Processor reads prompt.txt from the job directory
Runner executes llama.cpp inference
Tokens are generated and accumulated
On completion, result written and job moved to DONE or FAILED

You can monitor active jobs by listing the processing/ directory. Each subdirectory represents an in-flight job.

DONE - Completed Successfully

Directory: output/<job_id>/
Status enum: Status::Done
Duration: Indefinite (until client retrieves or manually cleaned)Jobs that completed successfully end up here.What happens:

Processor writes inference result to result.txt
Job atomically renamed from processing/ to output/
Client can retrieve result using Flow::get()
Job remains here until manually cleaned up

Directory contents:

output/<job_id>/
├── prompt.txt    ← original prompt
└── result.txt    ← inference output

FAILED - Error Occurred

Directory: failed/<job_id>/
Status enum: Status::Failed
Duration: Indefinite (until manually cleaned)Jobs that encountered errors during processing.What happens:

Processor catches exception or error during inference
Error message written to error.txt
Job atomically renamed from processing/ to failed/
Client can retrieve error using Flow::get()

Directory contents:

failed/<job_id>/
├── prompt.txt    ← original prompt
└── error.txt     ← error message

Common failure reasons:

Out of memory during inference
Model file corruption
Invalid prompt format
Context length exceeded

MISSING - Not Found

Directory: None
Status enum: Status::Missing
Duration: N/AThe job ID doesn’t exist in any directory.Possible reasons:

Invalid or typo’d job ID
Job was manually deleted
Job hasn’t been submitted yet
Workspace was cleared

State Machine

The job lifecycle follows a strict state machine with atomic transitions:

State Transitions

All state transitions are implemented as atomic directory renames:

// In Work::submit()
namespace fs = std::filesystem;

fs::path staging = workspace_ / "input/writing" / job_id;
fs::path ready = workspace_ / "input/ready" / job_id;

// Atomic rename - job becomes visible to Scanner
fs::rename(staging, ready);

Status Detection

The Flow class determines job status by checking directory existence in order:

Status Flow::status(const JobId& id) const noexcept {
    namespace fs = std::filesystem;
    
    // Check in priority order
    if (fs::exists(workspace_ / "output" / id))     return Status::Done;
    if (fs::exists(workspace_ / "failed" / id))     return Status::Failed;
    if (fs::exists(workspace_ / "processing" / id)) return Status::Running;
    if (fs::exists(workspace_ / "input/ready" / id)) return Status::Queued;
    
    return Status::Missing;
}

The Status enum is defined in types.hpp as a uint8_t for compact representation:

enum class Status : std::uint8_t { 
    Queued, Running, Done, Failed, Missing 
};

Job Identifier Format

Job IDs are generated using a timestamp-based format:

<unix_timestamp>_<process_id>_<counter>

Example: 1736700000_12345_0

1736700000 - Unix timestamp (seconds since epoch)
12345 - Process ID of the client
0 - Atomic counter within the process

This ensures:

Uniqueness: Across processes and time
Sortability: Chronological ordering
Debuggability: Timestamp visible in the ID

using JobId = std::string;

static JobId Work::generateId() noexcept {
    static std::atomic<uint64_t> counter{0};
    auto timestamp = std::time(nullptr);
    auto pid = getpid();
    auto count = counter.fetch_add(1);
    
    return std::to_string(timestamp) + "_" + 
           std::to_string(pid) + "_" + 
           std::to_string(count);
}

Orphaned Job Recovery

When the server starts, it checks for orphaned jobs that were left in processing/ due to crashes:

bool Server::recoverOrphanedJobs() noexcept {
    namespace fs = std::filesystem;
    
    auto processing = workspace_ / "processing";
    if (!fs::exists(processing)) return true;
    
    // Move all orphaned jobs back to ready queue
    for (const auto& entry : fs::directory_iterator(processing)) {
        auto ready = workspace_ / "input/ready" / entry.path().filename();
        fs::rename(entry.path(), ready);
        LOG_WARN("Recovered orphaned job: " + entry.path().filename().string());
    }
    
    return true;
}

Orphaned jobs are moved back to ready/ queue on server restart. They will be reprocessed from scratch - inference is not resumed from checkpoint.

Monitoring Job Progress

Filesystem
C++ API
CLI

Monitor jobs by watching directory changes:

# Watch for new jobs
watch -n 1 'ls -lt workspace/input/ready/'

# Monitor active jobs
watch -n 1 'ls -lt workspace/processing/'

# Check completed jobs
ls -lt workspace/output/

Poll job status programmatically:

Flow flow(workspace);

while (true) {
    auto status = flow.status(job_id);
    
    switch (status) {
        case Status::Queued:
            std::cout << "Waiting in queue..." << std::endl;
            break;
        case Status::Running:
            std::cout << "Processing..." << std::endl;
            break;
        case Status::Done:
            auto job = flow.get(job_id);
            std::cout << "Result: " << job->result << std::endl;
            return;
        case Status::Failed:
            auto job = flow.get(job_id);
            std::cerr << "Error: " << job->error << std::endl;
            return;
        case Status::Missing:
            std::cerr << "Job not found" << std::endl;
            return;
    }
    
    std::this_thread::sleep_for(std::chrono::seconds(1));
}

Use the flw command to check status:

# Poll until complete
while true; do
    flw workspace job_1736700000_12345_0 && break
    sleep 1
done

Best Practices

Don't Poll Too Aggressively

Poll every 1-5 seconds. Jobs typically take seconds to minutes to complete.

Clean Up Completed Jobs

Manually delete jobs from output/ and failed/ to prevent disk buildup.

Check for FAILED State

Always handle the FAILED state in your client code.

Never Modify Directories Manually

Let the system manage state transitions. Manual moves can cause race conditions.

Architecture

Overall system design and components

Filesystem Queue

Directory-based queue implementation

Work API

Job submission API reference

Flow API

Result retrieval API reference

Get Started

Core Concepts

CLI Tools

Guides

Configuration

Overview

Job States

State Machine

State Transitions

Status Detection

Job Identifier Format

Orphaned Job Recovery

Monitoring Job Progress

Best Practices

Don't Poll Too Aggressively

Clean Up Completed Jobs

Check for FAILED State

Never Modify Directories Manually

See Also

Architecture

Filesystem Queue

Work API

Flow API

Build docs developers (and LLMs) love

Get Started

Core Concepts

CLI Tools

Guides

Configuration

​Overview

​Job States

​State Machine

​State Transitions

​Status Detection

​Job Identifier Format

​Orphaned Job Recovery

​Monitoring Job Progress

​Best Practices

Don't Poll Too Aggressively

Clean Up Completed Jobs

Check for FAILED State

Never Modify Directories Manually

​See Also

Architecture

Filesystem Queue

Work API

Flow API

Build docs developers (and LLMs) love

Overview

Job States

State Machine

State Transitions

Status Detection

Job Identifier Format

Orphaned Job Recovery

Monitoring Job Progress

Best Practices

See Also