Skip to main content
Aeolos implements preemptive multitasking with a priority-based round-robin scheduler. Tasks can run in kernel mode or user mode, with each task having its own kernel stack and optional virtual address space.

Task Structure

Each task is represented by a task_t structure:
kernel/proc/task.h
typedef struct task_t {
    void* kstack_top;         // kernel stack top
    uint64_t cr3;             // virtual address space

    tid_t tid;                // task id
    priority_t priority;      // task priority
    tstatus_t status;         // current status of task
    timeval_t wakeuptime;     // time at which task should wake up
    tmode_t mode;             // kernel mode or usermode
    void* kstack_limit;       // kernel stack limit
    vec_struct(vfs_node_desc_t*) openfiles; // stores open files

    struct task_t* next;
    struct task_t* prev;
} task_t;

Task State

The CPU state for each task is stored on its kernel stack:
kernel/proc/task.h
typedef struct {
    uint64_t r15;
    uint64_t r14;
    uint64_t r13;
    uint64_t r12;
    uint64_t r11;
    uint64_t r10;
    uint64_t r9;
    uint64_t r8;
    uint64_t rbp;
    uint64_t rdi;
    uint64_t rsi;
    uint64_t rdx;
    uint64_t rcx;
    uint64_t rbx;
    uint64_t rax;
    uint64_t rip;
    uint64_t cs;
    uint64_t rflags;
    uint64_t rsp;
    uint64_t ss;
} task_state_t;

Task Status Flags

kernel/proc/task.h
typedef enum {
    TASK_RUNNING = 1 << 0,  // Currently executing on a CPU
    TASK_BLOCKED = 1 << 1,  // Blocked waiting for an event
    TASK_ASLEEP  = 1 << 2,  // Sleeping for a specified time
    TASK_DEAD    = 1 << 3   // Marked for cleanup
} tstatus_t;
#define TASK_READY 0  // Ready to run (no flags set)

Priority Levels

kernel/proc/task.h
typedef uint8_t priority_t;
#define PRIORITY_IDLE 0   // Idle tasks only
#define PRIORITY_BG   5   // Background tasks
#define PRIORITY_MIN  10  // Minimum priority
#define PRIORITY_MID  35  // Normal priority
#define PRIORITY_MAX  70  // Maximum priority
Currently, priority values are defined but not actively used by the scheduler. The scheduler implements round-robin scheduling regardless of priority.

Task Modes

kernel/proc/task.h
typedef enum {
    TASK_KERNEL_MODE,
    TASK_USER_MODE
} tmode_t;
Segment selectors are set based on task mode:
kernel/proc/task.h
#define KMODE_CS 0x08  // Kernel code segment
#define KMODE_SS 0x10  // Kernel stack segment
#define UMODE_CS 0x1b  // User code segment
#define UMODE_SS 0x23  // User stack segment
#define RFLAGS_DEFAULT 0x0202  // Interrupts enabled

Task Creation

Tasks are created with the task_make() function:
kernel/proc/task.c
task_t* task_make(void (*entry)(tid_t), priority_t priority, tmode_t mode, void* rsp, uint64_t pagemap)
{
    // try to allocate a tid
    tid_t ntask_tid = atomic_fetch_inc(&curr_tid);
    if (ntask_tid > TID_MAX) {
        ntask_tid = TID_MAX;
        klog_err("could not allocate tid\n");
        return NULL;
    }

    // allocate memory for the new task and its stack
    task_t* ntask = kmalloc(sizeof(task_t));
    ntask->kstack_limit = kmalloc(KSTACK_SIZE);
    ntask->kstack_top = ntask->kstack_limit + KSTACK_SIZE;

    // create the stack frame and update the state to defaults
    task_state_t* ntask_state = ntask->kstack_top - sizeof(task_state_t);
    if (mode == TASK_KERNEL_MODE) {
        ntask_state->cs = KMODE_CS;
        ntask_state->ss = KMODE_SS;
    } else {
        ntask_state->cs = UMODE_CS;
        ntask_state->ss = UMODE_SS;
    }
    ntask_state->rflags = RFLAGS_DEFAULT;
    ntask_state->rip = (uint64_t)entry;
    ntask_state->rsp = rsp ? (uint64_t)rsp : (uint64_t)ntask->kstack_top;
    ntask_state->rdi = ntask_tid; // pass the tid to the task

    // initialize the task
    if (pagemap)
        ntask->cr3 = pagemap;
    else
        read_cr("cr3", &(ntask->cr3));

    ntask->kstack_top = ntask_state;
    ntask->tid = ntask_tid;
    ntask->priority = priority;
    ntask->status = TASK_READY;
    ntask->wakeuptime = 0;
    vec_init(ntask->openfiles);

    return ntask;
}

Task ID Allocation

kernel/proc/task.c
// highest used tid
static tid_t curr_tid = 0;
Task IDs are allocated atomically starting from 0. Maximum TID is 65536:
kernel/proc/task.h
typedef size_t tid_t;
#define TID_MAX 65536

Kernel Stack

Each task has a 4KB kernel stack:
kernel/proc/task.h
#define KSTACK_SIZE 4096
The 4KB kernel stack is relatively small. Deep call chains or large stack allocations could cause overflow. Stack guards are not currently implemented.

Adding Tasks to Scheduler

The convenience function task_add() creates a task and adds it to the scheduler:
kernel/proc/task.c
tid_t task_add(void (*entry)(tid_t), priority_t priority, tmode_t mode, void* rsp, uint64_t pagemap)
{
    task_t* t = task_make(entry, priority, mode, rsp, pagemap);
    if (t) {
        sched_add(t);
        return t->tid;
    }
    return -1;
}

Scheduler

The scheduler implements preemptive multitasking using the APIC timer.

Scheduler State

kernel/proc/sched/sched.c
static lock_t sched_lock;

// an idle task for each cpu
static task_t* tasks_idle[CPU_MAX];

// currently running tasks on each cpu
static task_t* tasks_running[CPU_MAX];

// list of all tasks
vec_new_static(task_t*, tasks_all);

// asleep and dead tasks respectively
static tqueue_t tasks_asleep, tasks_dead;

Task Queues

The scheduler uses a doubly-linked list structure for task queues:
kernel/proc/sched/tqueue.h
typedef struct {
    task_t* front;
    task_t* back;
    uint64_t len;
} tqueue_t;
Queue operations:
void tq_push_front(tqueue_t* q, task_t* t)
{
    t->prev = NULL;
    t->next = q->front;
    if (q->front)
        q->front->prev = t;
    else
        q->back = t;
    q->front = t;
}

Scheduler Initialization

kernel/proc/sched/sched.c
void sched_init(void (*first_task)(tid_t))
{
    uint16_t cpu_id = smp_get_current_info()->cpu_id;

    /*
        create idle task for current core and set it to running
        idle tasks are removed from task list to prevent the
        scheduler from attempting to schedule them
    */
    tid_t tid_idle = task_add(idle, 0, TASK_KERNEL_MODE, NULL, 0);
    tasks_idle[cpu_id] = tasks_all.data[tid_idle];
    tasks_running[cpu_id] = tasks_all.data[tid_idle];
    tasks_all.data[tid_idle] = NULL;

    // scheduler has been started on the bsp, do some initialization
    if (first_task) {
        task_add(first_task, PRIORITY_MID, TASK_KERNEL_MODE, NULL, 0);
        task_add(sched_janitor, PRIORITY_MIN, TASK_KERNEL_MODE, NULL, 0);
        klog_ok("started on bsp\n");
    }

    // configure and start the lapic timer
    apic_timer_set_period(TIMESLICE_DEFAULT);
    apic_timer_set_mode(APIC_TIMER_MODE_ONESHOT);
    apic_timer_set_handler(init_context_switch);
    apic_timer_start();
}

Time Slice

kernel/proc/sched/sched.c
#define TIMESLICE_DEFAULT MILLIS_TO_NANOS(2)
Each task gets a 2 millisecond time slice before preemption.

Idle Task

When no tasks are ready to run, the CPU executes the idle task:
kernel/proc/sched/sched.c
_Noreturn static void idle(tid_t tid)
{
    (void)tid;
    while (true)
        asm volatile("hlt");
}
The hlt instruction halts the CPU until the next interrupt, saving power when idle.

Context Switching

Context switches are triggered by the APIC timer interrupt.

Assembly Entry Point

kernel/proc/sched/sched_asm.s
.global init_context_switch
.global finish_context_switch

.extern _do_context_switch

init_context_switch:
    push %rax
    push %rbx
    push %rcx
    push %rdx
    push %rsi
    push %rdi
    push %rbp
    push %r8
    push %r9
    push %r10
    push %r11
    push %r12
    push %r13
    push %r14
    push %r15

    mov %rsp, %rdi
    call _do_context_switch

finish_context_switch:
    mov (%rdi), %rsp

    pop %r15
    pop %r14
    pop %r13
    pop %r12
    pop %r11
    pop %r10
    pop %r9
    pop %r8
    pop %rbp
    pop %rdi
    pop %rsi
    pop %rdx
    pop %rcx
    pop %rbx
    pop %rax

    iretq

Scheduler Logic

kernel/proc/sched/sched.c
void _do_context_switch(task_state_t* state)
{
    lock_wait(&sched_lock);
    uint16_t cpu = smp_get_current_info()->cpu_id;

    // save state of current task
    task_t* curr = tasks_running[cpu];
    curr->kstack_top = state;
    curr->status &= ~TASK_RUNNING;

    // wake up tasks which need to be woken up
    while (tasks_asleep.back && tasks_asleep.back->wakeuptime < hpet_get_nanos()) {
        task_t* t = tq_pop_back(&tasks_asleep);
        t->status &= ~TASK_ASLEEP;
    }

    // choose next task
    task_t* next = NULL;
    for (tid_t i = curr->tid + 1; i < tasks_all.len; i++) {
        task_t* t = tasks_all.data[i];
        if (t != NULL && t->status == TASK_READY) {
            next = t;
            goto task_chosen;
        }
    }
    for (tid_t i = 0; i <= curr->tid; i++) {
        task_t* t = tasks_all.data[i];
        if (t != NULL && t->status == TASK_READY) {
            next = t;
            goto task_chosen;
        }
    }

task_chosen:
    // could not find a runnable task, so idle
    if (!next)
        next = tasks_idle[cpu];

    // we've chosen next task
    next->status |= TASK_RUNNING;
    tasks_running[cpu] = next;

    // set rsp0 in the tss
    smp_get_current_info()->tss.rsp0 = (uint64_t)(next->kstack_limit + KSTACK_SIZE);
    apic_timer_set_period(TIMESLICE_DEFAULT);
    apic_send_eoi();

    // release sched_lock and switch to next task
    // TODO: possible stack-related race condition?
    lock_release(&sched_lock);
    finish_context_switch(next);
}

Round-Robin Selection

The scheduler searches for the next ready task starting from curr->tid + 1, wrapping around to implement round-robin scheduling:
for (tid_t i = curr->tid + 1; i < tasks_all.len; i++) {
    task_t* t = tasks_all.data[i];
    if (t != NULL && t->status == TASK_READY) {
        next = t;
        goto task_chosen;
    }
}
for (tid_t i = 0; i <= curr->tid; i++) {
    task_t* t = tasks_all.data[i];
    if (t != NULL && t->status == TASK_READY) {
        next = t;
        goto task_chosen;
    }
}

Task Sleeping

Tasks can sleep for a specified duration:
kernel/proc/sched/sched.c
void sched_sleep(timeval_t nanos)
{
    // if sleep time is too little, busy sleep
    if (nanos <= TIMESLICE_DEFAULT) {
        hpet_nanosleep(nanos);
        return;
    }
    lock_wait(&sched_lock);

    task_t* curr = tasks_running[smp_get_current_info()->cpu_id];
    curr->wakeuptime = hpet_get_nanos() + nanos;
    curr->status |= TASK_ASLEEP;
    add_sleeping_sorted(curr);

    lock_release(&sched_lock);
    asm volatile("hlt");
}
Sleeping tasks are kept in a sorted queue by wakeup time:
kernel/proc/sched/sched.c
static void add_sleeping_sorted(task_t* t)
{
    if (!tasks_asleep.front)
        tq_push_front(&tasks_asleep, t);
    else if (tasks_asleep.back->wakeuptime > t->wakeuptime)
        tq_insert_after(&tasks_asleep, tasks_asleep.back, t);
    else {
        for (task_t* i = tasks_asleep.front; i; i = i->next) {
            if (i->wakeuptime <= t->wakeuptime) {
                tq_insert_after(&tasks_asleep, i->prev, t);
                break;
            }
        }
    }
}
For very short sleep durations (less than one time slice), the kernel uses busy-waiting with the HPET instead of scheduling.

Blocking and Unblocking

Tasks can be blocked to wait for events:
int64_t sched_block(tid_t tid)
{
    lock_wait(&sched_lock);

    if (!IS_TID_VALID(tid)) {
        klog_err("there's no task with tid %d\n", tid);
        lock_release(&sched_lock);
        return -1;
    }

    task_t* t = tasks_all.data[tid];
    if (t->status & TASK_BLOCKED) {
        klog_err("task %d is already blocked\n", tid);
        lock_release(&sched_lock);
        return -1;
    }
    t->status |= TASK_BLOCKED;

    bool is_current = (t == tasks_running[smp_get_current_info()->cpu_id]);

    lock_release(&sched_lock);

    if (is_current)
        asm volatile("hlt");
    return 0;
}

Task Cleanup

Dead tasks are cleaned up by a dedicated janitor task:
kernel/proc/sched/sched.c
_Noreturn static void sched_janitor(tid_t tid)
{
    (void)tid;
    while (true) {
        lock_wait(&sched_lock);
        task_t* t;
        while ((t = tq_pop_back(&tasks_dead))) {
            tasks_all.data[t->tid] = NULL;
            kmfree(t->kstack_limit);
            kmfree(t);
        }
        lock_release(&sched_lock);
        sched_sleep(SECONDS_TO_NANOS(1));
    }
}

Killing Tasks

kernel/proc/sched/sched.c
int64_t sched_kill(tid_t tid)
{
    lock_wait(&sched_lock);

    if (!IS_TID_VALID(tid)) {
        klog_err("there's no task with tid %d\n", tid);
        lock_release(&sched_lock);
        return -1;
    }

    // mark the task as dead, it will be cleaned up later
    task_t* t = tasks_all.data[tid];
    t->status |= TASK_DEAD;
    tasks_all.data[tid] = NULL;
    tq_push_front(&tasks_dead, t);

    // if it was sleeping, remove it from the sleeping list
    if (t->status & TASK_ASLEEP) {
        tq_remove(t, &tasks_asleep);
        t->status &= ~TASK_ASLEEP;
    }

    bool is_current = (t == tasks_running[smp_get_current_info()->cpu_id]);

    lock_release(&sched_lock);

    // we can't return if it was the current one
    if (is_current)
        asm volatile("hlt");
    return 0;
}
The janitor runs every second. Tasks marked as dead may not be cleaned up immediately, but their TIDs are freed instantly.

SMP Support

The scheduler is designed for SMP systems:
  • Each CPU has its own idle task and running task pointer
  • The scheduler lock protects shared data structures
  • sched_init() must be called on each CPU

API Summary

Task Management

FunctionDescription
task_make()Create a new task structure
task_add()Create and add task to scheduler

Scheduler Operations

FunctionDescription
sched_init()Initialize scheduler on a CPU
sched_add()Add existing task to scheduler
sched_sleep()Sleep current task for duration
sched_block()Block a task indefinitely
sched_unblock()Unblock a blocked task
sched_kill()Kill a task
sched_get_current()Get currently running task

Design Summary

Tasks are preempted every 2ms by the APIC timer. Context switches save/restore all general-purpose registers and use iretq to switch privilege levels.
The scheduler selects the next ready task in a circular fashion. Priority values are defined but not currently used in scheduling decisions.
Tasks can be READY, RUNNING, BLOCKED, ASLEEP, or DEAD. Multiple flags can be set simultaneously (e.g., RUNNING | ASLEEP).
Dead tasks are queued and cleaned up asynchronously by the janitor task, preventing scheduler complexity during task termination.

Next Steps

Architecture Overview

Return to high-level architecture overview

Boot Process

Learn about kernel initialization

Build docs developers (and LLMs) love