kernel/sched/core.c) iterates classes from highest to lowest priority, calling pick_next_task() on each until one returns a runnable task.
Scheduling classes
Scheduling classes are defined bystruct sched_class and form a linked priority list. Each class manages its own runqueues and implements hooks for enqueueing, dequeueing, and selecting the next task.
Stop (stop_sched_class)
Highest priority. Used internally for CPU migration and hotplug. Not accessible from user space.
Deadline (dl_sched_class)
Earliest Deadline First (EDF) scheduling for tasks with hard real-time deadlines declared via
sched_setattr(2). Guarantees worst-case latency bounds.Real-Time (rt_sched_class)
POSIX
SCHED_FIFO and SCHED_RR. Uses 100 priority levels (1–99). Always preempts CFS tasks. Implemented in kernel/sched/rt.c.Fair (fair_sched_class)
The default class for ordinary tasks (
SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE). Implemented by CFS/EEVDF in kernel/sched/fair.c.Idle (idle_sched_class)
Runs the per-CPU idle thread when no other task is runnable. Puts the CPU into a low-power state.
kernel/sched/sched.h):
Completely Fair Scheduler (CFS)
CFS (kernel/sched/fair.c) models an ideal, precise multi-tasking CPU that runs all tasks simultaneously, each at 1/nr_running speed. On real hardware it approximates this ideal using the concept of virtual runtime.
From the kernel documentation (Documentation/scheduler/sched-design-CFS.rst):
80% of CFS’s design can be summed up in a single sentence: CFS basically models an “ideal, precise multi-tasking CPU” on real hardware.
vruntime
Every task tracks its CPU consumption inp->se.vruntime (nanoseconds). CFS normalises actual runtime by the task’s weight (derived from its nice value), so a higher-priority task accumulates vruntime more slowly than a lower-priority one.
Red-black tree runqueue
CFS stores all runnable tasks in a time-ordered red-black tree keyed by vruntime. The leftmost node is always the next task to run. This provides O(log n) enqueue/dequeue and O(1) min lookup.rq->cfs.min_vruntime, a monotonically increasing value used to place newly woken tasks near the left of the tree — preventing stale tasks from unfairly jumping to the front after a long sleep.
Priority and nice values
Nice values map to scheduling weights. The weight ratio between adjacent nice levels is approximately 1.25×, so each step represents roughly a 10% change in CPU share.| nice value | weight | relative share |
|---|---|---|
| -20 | 88761 | highest |
| 0 | 1024 | default |
| 19 | 15 | lowest |
EEVDF scheduler
Starting with Linux 6.6, the Earliest Eligible Virtual Deadline First (EEVDF) scheduler is integrated into the fair class, progressively replacing the classical CFS pick logic. FromDocumentation/scheduler/sched-eevdf.rst:
Similarly to CFS, EEVDF aims to distribute CPU time equally among all runnable tasks with the same priority. To do so, it assigns a virtual run time to each task, creating a “lag” value that can be used to determine whether a task has received its fair share of CPU time.
How EEVDF works
Compute lag
Each task has a
lag value: positive lag means the task is owed CPU time; negative lag means it has exceeded its share.Find eligible tasks
Only tasks with
lag >= 0 are eligible to run next. This prevents tasks that have already overconsumed from being selected.Select earliest virtual deadline
Among eligible tasks, EEVDF picks the one with the earliest virtual deadline (VD), calculated from the task’s vruntime and its requested time slice.
sched_setattr(2) with the sched_runtime field, enabling latency-sensitive applications to receive shorter, more frequent time slices.
Real-time scheduling
Real-time tasks useSCHED_FIFO or SCHED_RR and always preempt CFS tasks. RT tasks run at priorities 1–99 (99 is highest).
- SCHED_FIFO
- SCHED_RR
- SCHED_DEADLINE
A FIFO task runs until it voluntarily yields, blocks, or is preempted by a higher-priority RT task. There is no time quantum — it can run indefinitely.
SMP load balancing
On multi-core systems, the scheduler must distribute work across CPUs. Linux models the CPU topology as a hierarchy of scheduling domains built from the system’s cache and NUMA topology.run_rebalance_domains() on each CPU’s scheduler tick, or immediately when a CPU goes idle. The balancer moves tasks from overloaded CPUs to idle ones while respecting NUMA affinity preferences to minimise remote memory accesses.
CPU affinity and cgroups
CPU affinity
CPU affinity restricts which CPUs a task is allowed to run on. The kernel stores the allowed mask intask_struct->cpus_mask.
cgroup CPU control
cgroups v2 provides two CPU controllers:cpu controller (bandwidth)
cpu controller (bandwidth)
Limits CPU time using the CFS bandwidth mechanism. A group with
cpu.max = 200000 1000000 gets at most 200 ms of CPU per 1000 ms period.cpu controller (weight)
cpu controller (weight)
Proportional scheduling via
cpu.weight (1–10000, default 100). Analogous to the classic cpu.shares.cpuset controller
cpuset controller
Pins tasks in a cgroup to specific CPUs and NUMA memory nodes. Useful for real-time and latency-sensitive workloads that must avoid cache interference.
Group scheduling with CFS
WithCONFIG_FAIR_GROUP_SCHED, CFS operates on scheduling entities that can represent either individual tasks or task groups. This allows fair CPU distribution among users or containers before distributing within each group.
