Skip to main content
The Linux scheduler multiplexes CPU time across runnable tasks. It is structured as a hierarchy of scheduling classes, each implementing a distinct policy. The core dispatcher (kernel/sched/core.c) iterates classes from highest to lowest priority, calling pick_next_task() on each until one returns a runnable task.

Scheduling classes

Scheduling classes are defined by struct sched_class and form a linked priority list. Each class manages its own runqueues and implements hooks for enqueueing, dequeueing, and selecting the next task.

Stop (stop_sched_class)

Highest priority. Used internally for CPU migration and hotplug. Not accessible from user space.

Deadline (dl_sched_class)

Earliest Deadline First (EDF) scheduling for tasks with hard real-time deadlines declared via sched_setattr(2). Guarantees worst-case latency bounds.

Real-Time (rt_sched_class)

POSIX SCHED_FIFO and SCHED_RR. Uses 100 priority levels (1–99). Always preempts CFS tasks. Implemented in kernel/sched/rt.c.

Fair (fair_sched_class)

The default class for ordinary tasks (SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE). Implemented by CFS/EEVDF in kernel/sched/fair.c.

Idle (idle_sched_class)

Runs the per-CPU idle thread when no other task is runnable. Puts the CPU into a low-power state.
Each class provides these hooks (partial list from kernel/sched/sched.h):
struct sched_class {
    void (*enqueue_task)(struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags);
    void (*yield_task)(struct rq *rq);
    void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
    struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev,
                                          struct rq_flags *rf);
    void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next);
    void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
    void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
};

Completely Fair Scheduler (CFS)

CFS (kernel/sched/fair.c) models an ideal, precise multi-tasking CPU that runs all tasks simultaneously, each at 1/nr_running speed. On real hardware it approximates this ideal using the concept of virtual runtime. From the kernel documentation (Documentation/scheduler/sched-design-CFS.rst):
80% of CFS’s design can be summed up in a single sentence: CFS basically models an “ideal, precise multi-tasking CPU” on real hardware.

vruntime

Every task tracks its CPU consumption in p->se.vruntime (nanoseconds). CFS normalises actual runtime by the task’s weight (derived from its nice value), so a higher-priority task accumulates vruntime more slowly than a lower-priority one.
/*
 * Update the current task's vruntime (kernel/sched/fair.c)
 * sysctl_sched_base_slice defaults to 700000 ns (0.70 ms * (1 + ilog(ncpus)))
 */
unsigned int sysctl_sched_base_slice = 700000ULL;
CFS always picks the task with the smallest vruntime — that is, the task that has received the least CPU time relative to its entitlement.

Red-black tree runqueue

CFS stores all runnable tasks in a time-ordered red-black tree keyed by vruntime. The leftmost node is always the next task to run. This provides O(log n) enqueue/dequeue and O(1) min lookup.
          [vruntime=50]
         /             \
   [vt=30]           [vt=80]
   /     \            /    \
[vt=20] [vt=40]  [vt=60]  [vt=100]
  ^
  next task to run (leftmost node)
The runqueue also maintains rq->cfs.min_vruntime, a monotonically increasing value used to place newly woken tasks near the left of the tree — preventing stale tasks from unfairly jumping to the front after a long sleep.

Priority and nice values

Nice values map to scheduling weights. The weight ratio between adjacent nice levels is approximately 1.25×, so each step represents roughly a 10% change in CPU share.
# Run a process at nice 10 (lower priority)
nice -n 10 ./myprogram

# Change the nice value of a running process
renice -n -5 -p <pid>

# Use SCHED_BATCH for non-interactive workloads
chrt --batch 0 ./batch_job
nice valueweightrelative share
-2088761highest
01024default
1915lowest

EEVDF scheduler

Starting with Linux 6.6, the Earliest Eligible Virtual Deadline First (EEVDF) scheduler is integrated into the fair class, progressively replacing the classical CFS pick logic. From Documentation/scheduler/sched-eevdf.rst:
Similarly to CFS, EEVDF aims to distribute CPU time equally among all runnable tasks with the same priority. To do so, it assigns a virtual run time to each task, creating a “lag” value that can be used to determine whether a task has received its fair share of CPU time.

How EEVDF works

1

Compute lag

Each task has a lag value: positive lag means the task is owed CPU time; negative lag means it has exceeded its share.
2

Find eligible tasks

Only tasks with lag >= 0 are eligible to run next. This prevents tasks that have already overconsumed from being selected.
3

Select earliest virtual deadline

Among eligible tasks, EEVDF picks the one with the earliest virtual deadline (VD), calculated from the task’s vruntime and its requested time slice.
4

Handle sleeping tasks

When a task sleeps, it remains on the runqueue marked for “deferred dequeue.” Its lag decays over virtual runtime, preventing short sleeps from gaming the scheduler.
Tasks can request specific time slices using sched_setattr(2) with the sched_runtime field, enabling latency-sensitive applications to receive shorter, more frequent time slices.
struct sched_attr attr = {
    .size           = sizeof(attr),
    .sched_policy   = SCHED_NORMAL,
    .sched_runtime  = 1000000,   /* 1 ms time slice request */
};
sched_setattr(0, &attr, 0);

Real-time scheduling

Real-time tasks use SCHED_FIFO or SCHED_RR and always preempt CFS tasks. RT tasks run at priorities 1–99 (99 is highest).
A FIFO task runs until it voluntarily yields, blocks, or is preempted by a higher-priority RT task. There is no time quantum — it can run indefinitely.
struct sched_param param = { .sched_priority = 50 };
sched_setscheduler(0, SCHED_FIFO, &param);
RT tasks can starve CFS tasks indefinitely. The real-time throttling mechanism (/proc/sys/kernel/sched_rt_runtime_us and sched_rt_period_us) limits the fraction of CPU time RT tasks can consume, defaulting to 950 ms out of every 1000 ms.

SMP load balancing

On multi-core systems, the scheduler must distribute work across CPUs. Linux models the CPU topology as a hierarchy of scheduling domains built from the system’s cache and NUMA topology.
NUMA node 0          NUMA node 1
┌──────────────┐    ┌──────────────┐
│ Socket 0     │    │ Socket 1     │
│ ┌───┐ ┌───┐ │    │ ┌───┐ ┌───┐ │
│ │C0 │ │C1 │ │    │ │C2 │ │C3 │ │
│ └───┘ └───┘ │    │ └───┘ └───┘ │
└──────────────┘    └──────────────┘
Load balancing is triggered periodically by run_rebalance_domains() on each CPU’s scheduler tick, or immediately when a CPU goes idle. The balancer moves tasks from overloaded CPUs to idle ones while respecting NUMA affinity preferences to minimise remote memory accesses.
# View per-CPU scheduler statistics
cat /proc/schedstat

# Pin a process to specific CPUs
taskset -c 0,1 ./myprogram

# Pin using a cpuset cgroup
mkdir /sys/fs/cgroup/cpuset/myset
echo 0-3 > /sys/fs/cgroup/cpuset/myset/cpuset.cpus
echo 0   > /sys/fs/cgroup/cpuset/myset/cpuset.mems
echo $$  > /sys/fs/cgroup/cpuset/myset/tasks

CPU affinity and cgroups

CPU affinity

CPU affinity restricts which CPUs a task is allowed to run on. The kernel stores the allowed mask in task_struct->cpus_mask.
/* Set affinity from kernel code */
set_cpus_allowed_ptr(task, cpumask);

/* From user space */
// sched_setaffinity(pid, cpusetsize, mask);

cgroup CPU control

cgroups v2 provides two CPU controllers:
Limits CPU time using the CFS bandwidth mechanism. A group with cpu.max = 200000 1000000 gets at most 200 ms of CPU per 1000 ms period.
mkdir /sys/fs/cgroup/limited
echo "200000 1000000" > /sys/fs/cgroup/limited/cpu.max
echo $$ > /sys/fs/cgroup/limited/cgroup.procs
Proportional scheduling via cpu.weight (1–10000, default 100). Analogous to the classic cpu.shares.
# Give group A twice the CPU of group B
echo 200 > /sys/fs/cgroup/groupA/cpu.weight
echo 100 > /sys/fs/cgroup/groupB/cpu.weight
Pins tasks in a cgroup to specific CPUs and NUMA memory nodes. Useful for real-time and latency-sensitive workloads that must avoid cache interference.
echo "0-3"  > /sys/fs/cgroup/myset/cpuset.cpus
echo "0"    > /sys/fs/cgroup/myset/cpuset.mems

Group scheduling with CFS

With CONFIG_FAIR_GROUP_SCHED, CFS operates on scheduling entities that can represent either individual tasks or task groups. This allows fair CPU distribution among users or containers before distributing within each group.
# Legacy cgroup v1 example: give multimedia group twice the share of browser
mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
mkdir /sys/fs/cgroup/cpu/multimedia
mkdir /sys/fs/cgroup/cpu/browser
echo 2048 > /sys/fs/cgroup/cpu/multimedia/cpu.shares
echo 1024 > /sys/fs/cgroup/cpu/browser/cpu.shares

Build docs developers (and LLMs) love